Linear Regression
The foundation of supervised learning. Linear regression models the relationship between input features and a continuous output by fitting a straight line (or hyperplane) through the data.
Plain-English Explanation
Linear regression assumes that the output (label) is a weighted sum of the input features plus a bias term: ลท = wโxโ + wโxโ + โฆ + b. Training finds the weights that minimize the average prediction error (mean squared error) across all training examples. Think of it as fitting the "best straight line" through scattered data points.
When to Use / When Not to Use
- Output is a continuous number (price, temperature, score)
- You need interpretability (explainable coefficients)
- Features have a roughly linear relationship with the target
- Baseline model before trying complex approaches
- Output is categorical (use classification instead)
- Features have strong non-linear interactions
- Data has many outliers (OLS is sensitive to them)
- Features are highly correlated (multicollinearity)
Algorithm Variants
| Variant | Key difference | When to use |
|---|---|---|
| Ordinary Least Squares (OLS) | Minimizes MSE directly | Baseline, small datasets |
| Ridge (L2 regularization) | Penalizes large weights | Multicollinearity, many features |
| Lasso (L1 regularization) | Can zero out weights | Feature selection needed |
| ElasticNet | L1 + L2 combined | Many features, some correlated |
| Polynomial Regression | Adds xยฒ, xยณ terms | Curved relationships |
Key Metrics
| Metric | Formula | What it tells you |
|---|---|---|
| MAE | mean(|y โ ลท|) | Average absolute error, robust to outliers |
| MSE | mean((y โ ลท)ยฒ) | Penalizes large errors more heavily |
| RMSE | โMSE | Same units as target, most interpretable |
| Rยฒ Score | 1 โ SS_res/SS_tot | % variance explained (1.0 = perfect) |
| Adjusted Rยฒ | Penalizes # features | Fairer comparison across models |
Code Example
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
# Load your data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train OLS
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Evaluate
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Rยฒ: {r2:.3f} RMSE: {rmse:.3f}")
# Coefficients
print("Intercept:", model.intercept_)
print("Weights:", model.coef_)
# With Ridge regularization
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
Notebook & Demo
Common Mistakes
โ Watch-outs
- Not scaling features when using regularized variants (Ridge, Lasso need StandardScaler)
- Ignoring the residuals plot โ always check for patterns in errors
- Using Rยฒ alone without checking absolute error magnitude
- Assuming linearity without plotting feature vs. target scatter
- Data leakage: fitting the scaler on all data before splitting
Project Idea
๐ก House Price Predictor
Build a Streamlit app that predicts property prices using area, location, and bedroom count. Train on a public dataset (Bengaluru Housing, Boston Housing). Add Ridge regularization and compare against OLS. Great for understanding feature importance and model transparency.
Browse Full Project Kits โQuiz
Test your understanding of Linear Regression with 10 questions. Pass 70% to mark this topic complete.
Take Quiz โClassification
Predict which category a data point belongs to. Classification is the most common ML task in placement interviews and practical projects.
Plain-English Explanation
Instead of predicting a number (regression), classification predicts a label: spam or not spam, disease or healthy, fraud or legitimate. The model learns a decision boundary that separates classes. Logistic regression is the simplest classifier โ despite the name, it outputs a probability of belonging to a class.
When to Use
- Output is a category (yes/no, A/B/C)
- You need probability estimates
- Multi-class output is required
- Target is continuous (use regression)
- Extreme class imbalance without handling
- Very few labeled samples per class
Key Metrics
| Metric | Formula | When to prioritize |
|---|---|---|
| Accuracy | (TP+TN)/total | Balanced classes |
| Precision | TP/(TP+FP) | Cost of false positives is high (spam filter) |
| Recall | TP/(TP+FN) | Cost of false negatives is high (cancer detection) |
| F1 Score | 2ยท(PยทR)/(P+R) | Imbalanced classes, need balance of P and R |
| ROC-AUC | Area under ROC curve | Threshold-independent performance |
Code Example
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_sc, y_train)
y_pred = clf.predict(X_test_sc)
y_proba = clf.predict_proba(X_test_sc)[:, 1]
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
Common Mistakes
โ Watch-outs
- Using accuracy alone on imbalanced datasets (use F1 or AUC instead)
- Not applying SMOTE or class_weight when classes are skewed
- Confusing precision and recall โ always ask which error is costlier
- Choosing the threshold at 0.5 without evaluating the full ROC curve
Project Idea
๐ก Placement Predictor
Build a logistic regression model to predict whether a student will get placed based on CGPA, internships, projects, and branch. Add a confusion matrix visualization and deploy on Streamlit.
Interactive Notebook
Quiz
Test your understanding of Classification with 10 questions. Pass 70% to mark this topic complete.
Take Quiz โTree-Based Learning
Decision trees, random forests, and their variants. One of the most interview-tested ML family of algorithms.
Plain-English Explanation
A decision tree splits data by asking yes/no questions about features. At each node, it picks the split that best separates the classes (using Gini or entropy). Random Forest builds many trees on random subsets of data and features, then averages their outputs โ reducing variance through ensemble averaging.
Key Metrics & Hyperparameters
| Parameter | Effect | Typical range |
|---|---|---|
| max_depth | Controls tree depth, limits overfitting | 3โ15 |
| n_estimators (RF) | More trees = lower variance, more compute | 100โ500 |
| min_samples_split | Min samples to allow a split | 2โ20 |
| max_features | Features considered per split | sqrt(n), log2(n) |
| criterion | Split quality measure | gini, entropy |
Code Example
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
rf = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
rf.fit(X_train, y_train)
# Feature importance
fi = pd.Series(rf.feature_importances_, index=feature_names)
fi.sort_values(ascending=False).plot(kind='bar')
Common Mistakes
โ Watch-outs
- Single decision trees overfit badly โ always use ensembles (RF or Boosting)
- Assuming feature importance from RF handles correlated features well (it does not)
- Forgetting that RF can still overfit if max_depth is unconstrained
Interactive Notebook
Quiz
Test your understanding of Tree Based Learning with 10 questions. Pass 70% to mark this topic complete.
Take Quiz โBoosting
XGBoost, LightGBM, and gradient boosting. The algorithm that wins Kaggle competitions. Sequentially trains weak learners, each correcting the previous one's errors.
Plain-English Explanation
Boosting trains trees sequentially. Each new tree focuses on the data points the previous ensemble got wrong. Gradient Boosting does this by fitting each tree to the residual errors. XGBoost and LightGBM add regularization, histogram-based splits, and speed optimizations on top of this idea.
Algorithm Comparison
| Algorithm | Speed | Memory | Best for |
|---|---|---|---|
| Gradient Boosting (sklearn) | Slow | Medium | Small-medium datasets, baseline |
| XGBoost | Fast | Higher | Structured data, Kaggle |
| LightGBM | Fastest | Low | Large datasets, categorical features |
| CatBoost | Fast | Medium | Many categorical features |
Code Example
import xgboost as xgb
from sklearn.model_selection import cross_val_score
model = xgb.XGBClassifier(
n_estimators=300,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
use_label_encoder=False,
eval_metric='logloss'
)
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.3f} ยฑ {scores.std():.3f}")
โ Watch-outs
- Low learning_rate requires more n_estimators โ balance them together
- Early stopping on a validation set prevents overfitting and saves compute
- XGBoost handles missing values natively โ don't impute blindly before using it
Interactive Notebook
Quiz
Test your understanding of Boosting with 10 questions. Pass 70% to mark this topic complete.
Take Quiz โSupport Vector Machines
SVMs find the maximum-margin hyperplane that separates classes. They are powerful for high-dimensional data and work well when classes are clearly separable.
Plain-English Explanation
SVM tries to find the widest possible "street" (margin) between two classes. The data points closest to the margin are called support vectors. The kernel trick lets SVMs work in higher-dimensional spaces without explicitly computing those dimensions โ enabling non-linear decision boundaries.
Kernel Options
| Kernel | When to use |
|---|---|
| Linear | Linearly separable data, text classification |
| RBF (Radial Basis Function) | Non-linear data, most common default |
| Polynomial | Polynomial feature interactions |
| Sigmoid | Neural network-like, less common |
Code Example
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel='rbf', C=1.0, gamma='scale', probability=True))
])
pipe.fit(X_train, y_train)
โ Watch-outs
- SVMs are slow on large datasets โ prefer XGBoost or Random Forest when n > 50k
- Always scale features before SVM โ it is margin-based and very sensitive to scale
- Hyperparameter C controls the trade-off between margin width and misclassifications
Interactive Notebook
Quiz
Test your understanding of Svm with 10 questions. Pass 70% to mark this topic complete.
Take Quiz โClustering & Unsupervised Learning
Find natural groups in data without labels. Used in customer segmentation, document grouping, image compression, and anomaly detection preprocessing.
Algorithm Comparison
| Algorithm | Must specify K? | Handles noise? | Best for |
|---|---|---|---|
| K-Means | Yes | No | Compact spherical clusters |
| DBSCAN | No | Yes | Arbitrary shapes, outlier detection |
| Hierarchical | No (choose post) | Partially | Dendrogram visualization |
| GMM | Yes | Soft | Overlapping clusters, soft assignment |
Code Example
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Elbow method to choose K
inertias = []
for k in range(2, 11):
km = KMeans(n_clusters=k, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
# Silhouette score for validation
labels = KMeans(n_clusters=4).fit_predict(X)
score = silhouette_score(X, labels)
print(f"Silhouette: {score:.3f}") # higher is better
Interactive Notebook
Quiz
Test your understanding of Clustering with 10 questions. Pass 70% to mark this topic complete.
Take Quiz โAnomaly Detection
Identify unusual data points โ fraud, system failures, manufacturing defects. Most anomaly problems are highly imbalanced.
Key Approaches
| Method | How it works | Best for |
|---|---|---|
| Isolation Forest | Anomalies are easier to isolate in random trees | Tabular data, general use |
| Local Outlier Factor | Compares density to k nearest neighbors | Cluster-based outliers |
| Z-Score | Flags points beyond n standard deviations | Gaussian, simple baselines |
| Autoencoder | High reconstruction error = anomaly | Complex patterns, images |
Code Example
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05, random_state=42)
labels = iso.fit_predict(X)
# -1 = anomaly, 1 = normal
anomalies = X[labels == -1]
Interactive Notebook
Quiz
Test your understanding of Anomaly Detection with 10 questions. Pass 70% to mark this topic complete.
Take Quiz โNaive Bayes & LDA
Probabilistic classifiers grounded in Bayes' theorem. Fast, interpretable, and surprisingly effective for text classification and small datasets.
Plain-English Explanation
Naive Bayes applies Bayes' theorem with the "naive" assumption that all features are independent given the class. Despite this oversimplification, it works very well for text (bag-of-words). LDA (Linear Discriminant Analysis) finds the linear combination of features that best separates classes โ it also doubles as a dimensionality reduction tool.
Variants
| Variant | Input type | Best for |
|---|---|---|
| GaussianNB | Continuous features | Real-valued features |
| MultinomialNB | Count features | Text, word frequency |
| BernoulliNB | Binary features | Binary feature vectors |
| LDA | Continuous | Multi-class + dimensionality reduction |
Interactive Notebook
Quiz
Test your understanding of Naive Bayes Lda with 10 questions. Pass 70% to mark this topic complete.
Take Quiz โTime Series Forecasting
Predict future values based on historical sequences. Used in stock prices, sales forecasting, energy demand, and inventory planning.
Key Concepts
| Concept | Meaning |
|---|---|
| Stationarity | Mean and variance don't change over time. Required for ARIMA. |
| Seasonality | Repeating patterns at fixed intervals (weekly, monthly) |
| Trend | Long-term upward or downward movement in the data |
| Autocorrelation (ACF) | Correlation of the series with its own past values |
| Partial Autocorrelation (PACF) | Direct correlation at each lag, removing intermediate lags |
Algorithm Family
| Model | Best for |
|---|---|
| ARIMA | Stationary or differenced series, no seasonal component |
| SARIMA | Seasonal patterns with ARIMA |
| Prophet (Meta) | Business time series with holidays and multiple seasonalities |
| LSTM/GRU | Long sequences, complex non-linear patterns |
| XGBoost with lags | Tabular approach to time series |
Project Idea
๐ก Sales Forecast Dashboard
Build an Inventory Forecasting Dashboard using ARIMA and Prophet on retail sales data. Show actual vs. predicted with confidence intervals. Add Streamlit UI and a CSV upload feature. This maps directly to project-03 on this platform.
See Inventory Forecasting Kit โInteractive Notebook
Quiz
Test your understanding of Time Series with 10 questions. Pass 70% to mark this topic complete.
Take Quiz โHyperparameter Optimization
Systematically search for the best model configuration. Moving from manual tuning to automated, principled search.
Search Strategies
| Strategy | How it works | Speed | Quality |
|---|---|---|---|
| Grid Search | Try every combination in a defined grid | Slow | Exhaustive |
| Random Search | Sample random combinations from distributions | Faster | Often better than grid |
| Bayesian Optimization | Builds a surrogate model of the objective | Efficient | Best for expensive models |
| Optuna / Hyperopt | Tree-structured Parzen Estimator (TPE) | Very efficient | State of the art |
Code Example โ Optuna
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 500),
'max_depth': trial.suggest_int('max_depth', 3, 12),
'learning_rate': trial.suggest_float('lr', 0.01, 0.3, log=True),
}
model = xgb.XGBClassifier(**params)
score = cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()
return score
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print(study.best_params)
Interactive Notebook
Quiz
Test your understanding of Hyperparameter Optimization with 10 questions. Pass 70% to mark this topic complete.
Take Quiz โ