AI/ML Interview Prep — Quick Reference

Concepts · Formulas · Trade-offs · One-liners to memorise

mitraaiprojects.com

Must-Know Definitions (1 line each)

Term	One-liner
Overfitting	Model memorises training data, fails on new data
Underfitting	Model too simple, misses patterns in training data
Bias	Error from wrong assumptions — systematic misfit
Variance	Error from sensitivity to training noise
Cross-entropy	−Σ y·log(p) — measures distribution divergence
Gradient descent	Iteratively move weights in negative gradient direction
Backpropagation	Chain rule applied to compute gradients in neural nets
Attention	softmax(QK^T/√d)·V — weighted sum over values
Embedding	Dense vector representing token/item semantically
Transfer learning	Reuse pretrained features for new related task
RAG	Retrieve relevant docs → inject as context → generate answer
Fine-tuning	Continue training pretrained model on task-specific data
LoRA	Add trainable low-rank matrices to frozen pretrained weights
RLHF	SFT → reward model on human prefs → PPO to optimise
Hallucination	LLM generates confident but factually wrong information

Key Formulas

Formula	Name
`TP/(TP+FP)`	Precision
`TP/(TP+FN)`	Recall
`2PR/(P+R)`	F1 Score
`1−SS_res/SS_tot`	R² Score
`mean((y−ŷ)²)`	MSE
`softmax(QK^T/√d)·V`	Attention
`−Σ y·log(p)`	Cross-entropy
`w += α·∇w`	Gradient descent
`W+B·A`	LoRA update
`W·\|x\|+(1−W)·x²`	ElasticNet

Algorithm Complexity

Algorithm	Train	Predict
Linear Regression	O(nd²)	O(d)
KNN	O(1)	O(nd)
Decision Tree	O(n·d·log n)	O(depth)
Random Forest	O(T·n·d·log n)	O(T·depth)
SVM	O(n²–n³)	O(sv·d)
Self-attention	O(n²·d)	O(n²·d)

Common Trade-offs to Know

Choice A	vs	Choice B	When to choose B
Accuracy	vs	Interpretability	Regulated domains (medical, legal)
Precision	vs	Recall	FN costly (cancer screening)
Bias (simple)	vs	Variance (complex)	More data → can increase complexity
Fine-tuning	vs	RAG	Knowledge changes frequently
LSTM	vs	Transformer	Always Transformer (if compute allows)
Ridge	vs	Lasso	Need feature selection
GPU training	vs	CPU inference	Small model → CPU saves cost
Deep model	vs	Ensemble	Structured/tabular → tree ensemble

Evaluation Metrics Quick Ref

Metric	Task	Key insight
Accuracy	Classification	Misleading for imbalanced classes
AUC-ROC	Binary clf	Threshold-independent ranking quality
PR-AUC	Imbalanced	Better than ROC when positives are rare
BLEU	Translation	n-gram precision, brevity penalised
ROUGE-L	Summarisation	Longest common subsequence recall
MAPE	Forecasting	Scale-free %, bad when actuals ≈ 0
Silhouette	Clustering	1=perfect, 0=overlapping, −1=wrong
Perplexity	LLM	Lower = model is more confident

Viva-Killer Questions with Answers

Q	Answer keyword
Why scale before SVM?	SVM uses Euclidean distance — large features dominate the margin
Random Forest vs XGBoost?	RF: parallel bagging, variance reduction. XGB: sequential boosting, bias reduction
Why use log loss not MSE for classification?	MSE doesn't penalise confident wrong predictions enough; log loss heavily penalises them
Why ADAM not SGD always?	SGD often better final accuracy for CNNs; Adam converges faster, better for NLP
Vanishing gradient fix?	ReLU (no saturation), ResNet skip connections, LSTM gates, BatchNorm
What is k in k-fold?	Number of equal-sized folds. k=5 or 10 is standard. Each fold serves as test set once.
Why use attention instead of RNN?	Parallel (not sequential), captures long-range dependencies equally, no vanishing gradient
What is data leakage?	Future/test information leaks into training — e.g. scaling on full data before split

Coding Interview Patterns (ML)

Linear Regression:
w -= lr * (2/m) * X.T @ (X@w - y)

Softmax:
e = np.exp(x - x.max())
return e / e.sum()

K-Means:
labels = np.argmin(dist(X,centroids),axis=1)
centroids = [X[labels==k].mean(0) for k in range(K)]

Cosine similarity:
np.dot(a,b) / (np.linalg.norm(a)*np.linalg.norm(b))

Precision/Recall:
precision = TP / (TP + FP)
recall    = TP / (TP + FN)
f1 = 2*precision*recall / (precision+recall)

RMSE:
np.sqrt(np.mean((y_true - y_pred)**2))