AI/ML Interview Preparation — Mitra AI Projects

🤖 ML Fundamentals

These appear in almost every ML interview. Know these cold.

What is the bias-variance trade-off?

Easy▼

Bias: error from wrong assumptions — underfitting (high bias = model too simple, misses patterns).
Variance: error from sensitivity to training data — overfitting (high variance = model memorises, doesn't generalise).

Trade-off: Increasing model complexity reduces bias but increases variance. Optimal point is between the two extremes.

Fixes: High bias → more features, more complex model, less regularisation. High variance → more data, regularisation, simpler model, dropout.

Explain L1 vs L2 regularisation.

Easy▼

L1 (Lasso): adds |w| to loss. Creates sparse models — drives some weights to exactly zero (feature selection). Use when many irrelevant features exist.

L2 (Ridge): adds w² to loss. Shrinks all weights toward zero but rarely to exactly zero. Better when all features contribute somewhat. Default choice for most problems.

ElasticNet: combines both. Use when you want sparsity but not L1's instability with correlated features.

When would you use precision vs recall vs F1?

Easy▼

Precision = TP/(TP+FP). Use when false positives are costly. Example: spam filter — wrongly marking real email as spam (FP) is bad.

Recall = TP/(TP+FN). Use when false negatives are costly. Example: cancer detection — missing a cancer (FN) is far worse than a false alarm.

F1 = harmonic mean of precision and recall. Use when you need to balance both, or when classes are imbalanced and accuracy is misleading.

How does gradient boosting work?

Medium▼

Gradient boosting builds trees sequentially. Each tree corrects the errors of the previous ensemble:

1. Start with a simple prediction (mean of target)
2. Compute residuals (actual − predicted)
3. Fit a tree to the residuals
4. Add tree to ensemble with learning rate: F(x) += lr × tree(x)
5. Repeat N times

XGBoost improvements: L1/L2 regularisation on weights, column subsampling, row subsampling, 2nd-order gradients (Newton method), early stopping. These make XGBoost more accurate and robust than sklearn GBM.

What is data leakage and how do you prevent it?

Medium▼

Data leakage occurs when information from outside the training period is used during training — giving artificially high CV scores that don't hold in production.

Types:
• Train-test contamination: scaler fitted on full data before split
• Target leakage: features that encode the target (e.g. "loan_paid_off" in loan default prediction)
• Time leakage: future data used to predict past

Prevention: Use sklearn Pipeline so all preprocessing fits only on training folds. Always split BEFORE any preprocessing. Review features for implicit target encoding.

Explain cross-validation and when you'd use k-fold vs stratified k-fold.

Easy▼

Cross-validation gives a more reliable performance estimate by using all data for both training and validation across k folds.

K-Fold: split into k equal parts, train on k-1, test on 1, rotate. Good for balanced datasets and regression.

Stratified K-Fold: each fold has the same class distribution as the full dataset. Essential for imbalanced classification problems. If 5% positives, each fold has 5% positives.

Time-Series CV: expanding window — never use future data to train. Standard k-fold would cause time leakage.

How would you handle class imbalance?

Medium▼

Multiple strategies depending on severity:

Algorithm level: class_weight='balanced' in sklearn — adjusts weights so minority class contributes more to loss. Simplest, often sufficient.

Sampling: Oversample minority (SMOTE — generates synthetic examples), undersample majority (random drop). Use with caution — can create noisy examples.

Threshold tuning: default threshold is 0.5 for predict(). Lower it to increase recall for minority class. Use precision-recall curve to find optimal threshold.

Metric: Don't use accuracy. Use F1, AUC-ROC, or PR-AUC for imbalanced datasets.

What is the curse of dimensionality?

Medium▼

As feature count grows, the volume of the space increases exponentially. Data becomes sparse — distances lose meaning, making distance-based methods (KNN, SVM) unreliable.

Effect: More dimensions require exponentially more data to maintain statistical power. With 10 features, need ~10^10 samples for the same density as 1 feature with 10 samples.

Fixes: Feature selection (remove irrelevant), PCA/dimensionality reduction, regularisation (implicit feature selection), domain knowledge to select fewer meaningful features.

🧠 Deep Learning

Architecture design, training tricks, and why things work.

Explain the vanishing gradient problem and solutions.

Medium▼

In deep networks, gradients shrink exponentially during backpropagation — early layers get near-zero updates and don't learn. Caused by repeatedly multiplying gradients <1 (sigmoid/tanh saturate).

Solutions:
• ReLU activation: gradient is 1 for positive inputs, doesn't saturate. Default choice for hidden layers.
• Residual connections (ResNet): skip connections provide a gradient highway — gradient can flow directly through identity mapping.
• Batch Normalisation: normalises layer inputs, keeps activations in non-saturating range.
• LSTM/GRU gates: designed specifically to control gradient flow in sequential models.

What is dropout and how does it prevent overfitting?

Easy▼

Dropout randomly deactivates neurons (with probability p, typically 0.2–0.5) during each training step.

Why it works: Forces the network to learn redundant representations — no single neuron can rely on specific other neurons always being present. Acts as training an ensemble of 2^n subnetworks.

Training vs inference: During inference, all neurons are active but outputs are scaled by (1−p) to maintain expected values.

When to use: After dense layers in fully connected networks. Not typically after conv layers (use spatial dropout instead). Less useful with batch normalisation.

What is Batch Normalisation and where do you place it?

Medium▼

BN normalises layer inputs: x̂ = (x − μ_batch) / σ_batch, then scales with learnable γ and shifts with β.

Benefits: Reduces internal covariate shift. Allows higher learning rates. Acts as regulariser. Makes training faster and more stable.

Placement debate: Original paper: Conv → BN → Activation. Modern practice often: Conv → Activation → BN. In Transformers: Pre-LN (LayerNorm before sublayer) is more stable for training very deep models.

Note: BN is batch-dependent — problematic for small batches. Use LayerNorm for NLP/Transformers, GroupNorm for small-batch image training.

Explain the self-attention mechanism in Transformers.

Hard▼

Self-attention allows every token to attend to every other token in the sequence:

Attention(Q,K,V) = softmax(QK^T / √d_k) × V

Each token creates Q (query), K (key), V (value) projections. Attention weights = how much each token should attend to every other token. Scaled by √d_k to prevent softmax saturation.

Multi-head: Run h attention heads in parallel, each with different W_Q, W_K, W_V projections. Each head can focus on different relationship types (syntax, coreference, semantics). Concatenate and project outputs.

Complexity: O(n²) in sequence length — quadratic memory and compute. Flash Attention reduces this via tiling.

What is transfer learning and when does it work?

Easy▼

Transfer learning reuses a model trained on a large dataset as the starting point for a new task, requiring less data and compute.

Works best when: Source and target domains are related (both are images, or both are English text). Pre-trained model has learned useful features (edges, textures for images; grammar, semantics for NLP).

Strategies:
• Feature extraction: freeze all layers, train only the new head. Use with very little data (<1K examples).
• Fine-tuning: unfreeze later layers, train with low learning rate. Better accuracy, needs more data (5K+).
• Full fine-tuning: unfreeze all layers. Needs large dataset to avoid catastrophic forgetting.

Why ImageNet features transfer: Low-level features (edges, textures) are universal across visual domains.

What is the difference between LSTM and GRU?

Medium▼

LSTM: 3 gates (forget, input, output) + cell state + hidden state. More parameters, more expressive. Better for tasks requiring fine-grained memory control over long sequences.

GRU: 2 gates (reset, update). Merges cell and hidden state. ~33% fewer parameters. Often comparable or better on shorter sequences. Faster to train.

Rule of thumb: Try GRU first (faster). Switch to LSTM if sequence dependencies are very long (>200 steps) or GRU doesn't converge.

Both solved by Transformers: For most NLP tasks today, Transformers outperform both due to global attention — LSTMs and GRUs are still used for streaming/real-time applications where Transformers are too expensive.

✨ Generative AI & LLMs

High demand in 2025. Every AI Engineer role will ask these.

What is RAG and why is it better than fine-tuning for knowledge-intensive tasks?

Medium▼

RAG retrieves relevant documents at query time and injects them into the prompt, grounding the LLM in real information.

RAG vs Fine-tuning for knowledge tasks:
• RAG: dynamic, updateable (add new docs without retraining), transparent (can cite sources), no training cost
• Fine-tuning: static (knowledge baked in), no source attribution, expensive to update

When to use RAG: Company knowledge bases, current events, legal/medical documents, any domain where information changes frequently.

When to fine-tune: Custom writing style, format/tone, domain-specific reasoning patterns, reducing prompt length for cost.

Explain the RAG pipeline end-to-end.

Medium▼

Indexing phase (one-time):
1. Load documents (PDF, web, database)
2. Chunk: split into ~500 token segments with 50-token overlap
3. Embed: run each chunk through embedding model (text-embedding-3-small)
4. Store: save vectors + metadata in vector DB (ChromaDB, Pinecone)

Query phase (per user question):
1. Embed the user's question
2. Similarity search: find top-k most similar chunks
3. Rerank (optional): cross-encoder reranks for precision
4. Inject: build prompt with retrieved chunks as context
5. Generate: LLM answers grounded in retrieved context

Key metric: Context Precision + Context Recall + Answer Faithfulness (RAGAS framework)

What is hallucination and how do you reduce it?

Easy▼

Hallucination: LLM generates confident but factually incorrect information. Caused by next-token prediction optimisation — plausible-sounding completions, not verified facts.

Reduction strategies:
• RAG: ground responses in retrieved real documents
• Prompt instruction: "Only answer based on the provided context. If not in the context, say 'I don't know.'"
• Temperature: lower temperature (0.0–0.3) makes responses more deterministic/conservative
• Citations: ask the model to cite sources — forces it to reference actual retrieved content
• Faithfulness check: RAGAS faithfulness metric, or a second LLM call to verify claims against retrieved context

What is LoRA and how does it reduce fine-tuning cost?

Hard▼

LoRA (Low-Rank Adaptation) adds small trainable rank-decomposition matrices alongside frozen pre-trained weights:

W_new = W_frozen + B × A where B is (m×r) and A is (r×n), r ≪ min(m,n)

Why it works: Weight updates during fine-tuning have low intrinsic rank — a small number of dimensions capture most of the adaptation. r=8-16 typically captures 95%+ of fine-tuning quality.

Cost savings: Only B and A train — typically 0.1–1% of original parameters. A 7B model normally needs 28GB to fine-tune — with QLoRA (4-bit quantisation + LoRA), fits in 6GB.

Deployment: Merge B×A back into W at inference — no extra latency. Or hot-swap adapters for multi-task serving.

Explain how RLHF works.

Hard▼

RLHF (Reinforcement Learning from Human Feedback) aligns LLMs with human preferences in 3 steps:

Step 1 — SFT: Supervised Fine-Tuning on high-quality demonstration data. Teaches the model the desired format and tone.

Step 2 — Reward Model: Humans rate pairs of model outputs (which is better). Train a reward model on these preferences. RM(response) → scalar score.

Step 3 — PPO Fine-Tuning: Use PPO (Proximal Policy Optimisation) to maximise the reward model's score. Add KL divergence penalty to prevent the policy from drifting too far from the SFT baseline.

Alternative: DPO — Direct Preference Optimisation. Skips the reward model entirely, optimises directly on preference pairs. Simpler, more stable, comparable results.

What are embeddings and what makes a good embedding model?

Medium▼

Embeddings map text to dense vectors where semantically similar text is geometrically close. Enable: semantic search, clustering, RAG retrieval, classification.

Good embedding model properties:
• High MTEB score for your task type (check mteb.leaderboard.huggingface.co)
• Appropriate dimensionality — 384d (all-MiniLM-L6-v2, 80MB) vs 1536d (text-embedding-3-large) — higher dim ≠ always better
• Domain-specific fine-tuning if needed — generic models may miss domain terminology
• Speed/cost — local models free, API models have per-token cost

Popular choices: all-MiniLM-L6-v2 (fast, free, local), BAAI/bge-m3 (multilingual), OpenAI text-embedding-3-small (strong, cheap API)

⚙️ ML System Design

Senior-level questions. Show you think about scale, reliability, and monitoring.

Design a recommendation system for an e-commerce platform.

Hard▼

Requirements clarification: Real-time vs batch? Cold start handling? CTR vs purchase optimisation?

Architecture (two-stage):
1. Retrieval (candidate generation): Collaborative filtering (ALS/neural CF) + item embeddings. Return top-500 candidates from millions. Fast, approximate.
2. Ranking: LightGBM or neural network with user features + item features + context. Scores 500 candidates, returns top-20. Slower, more accurate.

Features: User history (purchases, views, dwell time), item attributes, context (device, time, location), collaborative signals (users like you bought X).

Cold start: New user → popularity-based + content-based. New item → content embeddings until behavioral data collected.

Monitoring: CTR, conversion rate, revenue per user, novelty/diversity. A/B test new models. Watch for filter bubbles.

How would you detect and handle model drift in production?

Hard▼

Types of drift:
• Data drift: input feature distribution changes (e.g., suddenly more mobile users)
• Concept drift: relationship between features and target changes (e.g., COVID changed what drives hotel bookings)
• Prediction drift: model output distribution shifts

Detection:
• Statistical tests: KS test, PSI (Population Stability Index), Jensen-Shannon divergence on feature distributions
• Performance monitoring: track model metrics on labelled subset with delayed labels
• Tools: Evidently AI, Fiddler, WhyLabs

Response:
• Alert when PSI > 0.2 or performance drops > 5%
• Trigger retraining pipeline with recent data
• Shadow test new model before replacing production

How would you serve a real-time ML model at 10,000 requests/second?

Hard▼

Model optimisation: Quantise to INT8 (4x smaller, 2-4x faster). Use ONNX Runtime or TensorRT for GPU inference. Batch incoming requests (dynamic batching).

Serving infrastructure: FastAPI + Uvicorn workers (async, multiple workers). Or TorchServe/Triton Inference Server for model-specific optimisations. Container with Docker, orchestrated with Kubernetes HPA.

Scaling: HPA scales pods based on CPU or custom metrics (request queue depth). Multiple regions with CDN for global latency reduction.

Latency targets: P50 <10ms, P99 <100ms. Monitor with Prometheus + Grafana. Alert on P99 spikes.

Caching: Cache frequent/repeated predictions in Redis. Feature store (Feast) for pre-computed features. Avoid recomputing expensive features per request.

💻 Coding for ML Interviews

Implement from scratch — shows you understand the math, not just the API.

Implement linear regression from scratch using gradient descent.

Medium▼

import numpy as np

class LinearRegression:
    def __init__(self, lr=0.01, n_iter=1000):
        self.lr, self.n_iter = lr, n_iter

    def fit(self, X, y):
        m, n = X.shape
        self.w = np.zeros(n)  # weights
        self.b = 0             # bias

        for _ in range(self.n_iter):
            y_pred = X @ self.w + self.b       # forward pass
            error  = y_pred - y
            dw = (2/m) * X.T @ error           # gradient w.r.t. w
            db = (2/m) * error.sum()           # gradient w.r.t. b
            self.w -= self.lr * dw             # gradient descent step
            self.b -= self.lr * db

    def predict(self, X):
        return X @ self.w + self.b

Key concepts to explain: MSE loss = mean((y_pred - y)²). Gradient of MSE w.r.t. w = (2/m) * X.T @ (y_pred - y). Gradient descent updates: move in direction of negative gradient.

Implement k-means clustering from scratch.

Medium▼

import numpy as np

def kmeans(X, k, max_iter=100):
    # Random initialisation
    centroids = X[np.random.choice(len(X), k, replace=False)]

    for _ in range(max_iter):
        # Assign: each point to nearest centroid
        dists = np.linalg.norm(X[:,None] - centroids[None,:], axis=2)
        labels = np.argmin(dists, axis=1)

        # Update: move centroids to cluster means
        new_centroids = np.array([X[labels == i].mean(axis=0)
                                  for i in range(k)])
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    return labels, centroids

Explain: Euclidean distance, convergence when centroids don't move. Limitations: requires k upfront, sensitive to initialisation (k-means++ fixes this), assumes spherical clusters.

Implement softmax and cross-entropy loss.

Medium▼

import numpy as np

def softmax(logits):
    # Subtract max for numerical stability (prevents exp overflow)
    e = np.exp(logits - logits.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

def cross_entropy(y_true, y_pred_logits):
    # y_true: integer class labels
    # y_pred_logits: raw scores (before softmax)
    probs = softmax(y_pred_logits)
    n = len(y_true)
    # Clip to avoid log(0)
    probs = np.clip(probs[np.arange(n), y_true], 1e-12, 1.0)
    return -np.log(probs).mean()

Key: Always subtract max before exp (numerical stability). Cross-entropy = -log(correct class probability). Gradient of softmax + cross-entropy combined simplifies to: (p - one_hot) / n.

💬 Behavioral & Project Questions

Use the STAR format: Situation, Task, Action, Result.

Tell me about your most challenging ML project.

Easy▼

Structure your answer:

Situation: "For my final year project, I built a Document Q&A system. The challenge was..."

Task: Explain the specific technical problem — not just what you built, but what was hard. "The challenge was that the LLM kept hallucinating facts not in the document."

Action: What you specifically did. "I implemented a faithfulness check using a second LLM call to verify each answer against retrieved chunks. I also added citation markup so users could see exactly which chunk each answer came from."

Result: Quantify if possible. "This reduced hallucination from 23% on our test set to under 5%. The app now serves accurate, cited responses."

Key: Show depth of thinking, not just feature list. Interviewers want to see how you think about problems.

Why did you choose [tech stack] for your project?

Easy▼

This question tests whether you made deliberate engineering decisions vs just following a tutorial.

Good answer structure:
1. State the alternatives you considered (1–2 others)
2. The specific reason you chose what you did (trade-off)
3. What you'd change with more time/resources

Example: "I chose ChromaDB over Pinecone because it runs locally with no API costs — important for a student project with limited budget. For production, I'd migrate to Pinecone or Weaviate for better scalability and managed infrastructure. I chose FastAPI over Flask because of automatic OpenAPI documentation and async support for concurrent document processing."

What would you improve in your project given more time?

Easy▼

This question is a gift — it shows you think critically about your own work.

Good areas to mention (pick 2-3):
• Evaluation: "I'd build a proper evaluation dataset with ground-truth Q&A pairs to measure retrieval precision and answer faithfulness systematically."
• Performance: "I'd add caching for repeated queries using Redis to reduce API costs and latency."
• Observability: "I'd add logging of every query, retrieved chunks, and response to identify failure patterns."
• Scale: "I'd containerise with Docker and add a task queue (Celery) for background document processing instead of blocking the UI."

Don't say: "I'd add more features." Saying specific technical improvements shows maturity.

How do you stay updated with the fast-moving AI field?

Easy▼

Be specific — vague answers like "I read articles" don't stand out.

Strong answer:
"I follow a few key sources: ArXiv Sanity (Andrej Karpathy's paper aggregator), Hugging Face blog for practical implementation insights, and Twitter/X accounts of ML researchers like Andrej Karpathy and Yann LeCun.

For staying practical: I do at least one Kaggle competition per month to apply what I learn. I also build small experiments — when LoRA came out, I fine-tuned a DistilBERT on a custom dataset to understand it hands-on, not just read about it.

I find that building something yourself, even a toy version, makes concepts stick much better than reading alone."

interview formats

What to Expect by Company Type

🇮🇳 Indian IT (TCS, Infosys, Wipro, Cognizant)

MCQ + programming round (HackerRank)
ML fundamentals (bias-variance, metrics, algorithms)
Python basics + SQL
Project explanation (know your project deeply)

🚀 Product Startups (Swiggy, Zomato, Meesho)

ML system design (recommendation, ranking)
Applied ML case studies
LeetCode medium problems (DSA)
Business metric optimisation

🌍 MNCs (Google, Microsoft, Amazon)

LeetCode Hard (for SDE roles)
Deep ML fundamentals + research awareness
System design at scale
Leadership principles / behavioral (Amazon)

🤖 AI-First Companies (Sarvam, Krutrim, Dhruva)

Deep LLM/GenAI knowledge
Fine-tuning, RAG, evaluation experience
Hands-on: live coding with HuggingFace/LangChain
Research awareness (recent papers)

Build Your Portfolio → Download Cheatsheets →