| Term | One-liner |
|---|---|
| Overfitting | Model memorises training data, fails on new data |
| Underfitting | Model too simple, misses patterns in training data |
| Bias | Error from wrong assumptions — systematic misfit |
| Variance | Error from sensitivity to training noise |
| Cross-entropy | −Σ y·log(p) — measures distribution divergence |
| Gradient descent | Iteratively move weights in negative gradient direction |
| Backpropagation | Chain rule applied to compute gradients in neural nets |
| Attention | softmax(QK^T/√d)·V — weighted sum over values |
| Embedding | Dense vector representing token/item semantically |
| Transfer learning | Reuse pretrained features for new related task |
| RAG | Retrieve relevant docs → inject as context → generate answer |
| Fine-tuning | Continue training pretrained model on task-specific data |
| LoRA | Add trainable low-rank matrices to frozen pretrained weights |
| RLHF | SFT → reward model on human prefs → PPO to optimise |
| Hallucination | LLM generates confident but factually wrong information |
| Formula | Name |
|---|---|
TP/(TP+FP) | Precision |
TP/(TP+FN) | Recall |
2PR/(P+R) | F1 Score |
1−SS_res/SS_tot | R² Score |
mean((y−ŷ)²) | MSE |
softmax(QK^T/√d)·V | Attention |
−Σ y·log(p) | Cross-entropy |
w += α·∇w | Gradient descent |
W+B·A | LoRA update |
W·|x|+(1−W)·x² | ElasticNet |
| Algorithm | Train | Predict |
|---|---|---|
| Linear Regression | O(nd²) | O(d) |
| KNN | O(1) | O(nd) |
| Decision Tree | O(n·d·log n) | O(depth) |
| Random Forest | O(T·n·d·log n) | O(T·depth) |
| SVM | O(n²–n³) | O(sv·d) |
| Self-attention | O(n²·d) | O(n²·d) |
| Choice A | vs | Choice B | When to choose B |
|---|---|---|---|
| Accuracy | vs | Interpretability | Regulated domains (medical, legal) |
| Precision | vs | Recall | FN costly (cancer screening) |
| Bias (simple) | vs | Variance (complex) | More data → can increase complexity |
| Fine-tuning | vs | RAG | Knowledge changes frequently |
| LSTM | vs | Transformer | Always Transformer (if compute allows) |
| Ridge | vs | Lasso | Need feature selection |
| GPU training | vs | CPU inference | Small model → CPU saves cost |
| Deep model | vs | Ensemble | Structured/tabular → tree ensemble |
| Metric | Task | Key insight |
|---|---|---|
| Accuracy | Classification | Misleading for imbalanced classes |
| AUC-ROC | Binary clf | Threshold-independent ranking quality |
| PR-AUC | Imbalanced | Better than ROC when positives are rare |
| BLEU | Translation | n-gram precision, brevity penalised |
| ROUGE-L | Summarisation | Longest common subsequence recall |
| MAPE | Forecasting | Scale-free %, bad when actuals ≈ 0 |
| Silhouette | Clustering | 1=perfect, 0=overlapping, −1=wrong |
| Perplexity | LLM | Lower = model is more confident |
| Q | Answer keyword |
|---|---|
| Why scale before SVM? | SVM uses Euclidean distance — large features dominate the margin |
| Random Forest vs XGBoost? | RF: parallel bagging, variance reduction. XGB: sequential boosting, bias reduction |
| Why use log loss not MSE for classification? | MSE doesn't penalise confident wrong predictions enough; log loss heavily penalises them |
| Why ADAM not SGD always? | SGD often better final accuracy for CNNs; Adam converges faster, better for NLP |
| Vanishing gradient fix? | ReLU (no saturation), ResNet skip connections, LSTM gates, BatchNorm |
| What is k in k-fold? | Number of equal-sized folds. k=5 or 10 is standard. Each fold serves as test set once. |
| Why use attention instead of RNN? | Parallel (not sequential), captures long-range dependencies equally, no vanishing gradient |
| What is data leakage? | Future/test information leaks into training — e.g. scaling on full data before split |
Linear Regression: w -= lr * (2/m) * X.T @ (X@w - y) Softmax: e = np.exp(x - x.max()) return e / e.sum() K-Means: labels = np.argmin(dist(X,centroids),axis=1) centroids = [X[labels==k].mean(0) for k in range(K)] Cosine similarity: np.dot(a,b) / (np.linalg.norm(a)*np.linalg.norm(b)) Precision/Recall: precision = TP / (TP + FP) recall = TP / (TP + FN) f1 = 2*precision*recall / (precision+recall) RMSE: np.sqrt(np.mean((y_true - y_pred)**2))