ML Models at a Glance

When to use · Key params · Metrics · sklearn code

mitraaiprojects.com

Regression Models

Model	When to Use	Key Param
Linear Reg	Linear relationships, interpretability	—
Ridge (L2)	Multicollinearity, many features	alpha
Lasso (L1)	Feature selection, sparse data	alpha
ElasticNet	Correlated + irrelevant features	alpha, l1_ratio
Polynomial	Curved relationships (add features)	degree
SVR	Small-medium, non-linear	C, kernel, eps
Random Forest	Non-linear, robust to outliers	n_estimators
XGBoost	Tabular, best accuracy	lr, n_est, depth

Regression Metrics

Metric	Formula	Best for
MAE	mean(\|y-ŷ\|)	Outlier-robust
MSE	mean((y-ŷ)²)	Penalises large errors
RMSE	√MSE	Same units as target
R²	1−SS_res/SS_tot	Explained variance
MAPE	mean(\|y-ŷ\|/\|y\|)	Scale-free %

Classification Models

Model	Strengths	Weaknesses	Key Params
Logistic Reg	Fast, interpretable, calibrated	Linear boundary only	C, max_iter
Decision Tree	Interpretable, no scaling	Overfits easily	max_depth, min_samples
Random Forest	Robust, handles missing, feature imp	Slow, black box	n_estimators, max_depth
XGBoost	Best tabular accuracy, regularised	Many params	learning_rate, n_estimators, max_depth
LightGBM	Fastest boosting, big data	Overfits small data	num_leaves, min_data
SVM	Great for small data, non-linear via kernel	Slow on large data	C, kernel, gamma
KNN	Simple, no training, non-linear	Slow prediction, scaling needed	n_neighbors, metric
Naive Bayes	Fast, text, small data	Feature independence assumption	var_smoothing
MLP	Complex patterns, flexible	Black box, slow	hidden_layers, lr

Classification Metrics

Metric	Formula	Use when
Accuracy	correct/total	Balanced classes
Precision	TP/(TP+FP)	FP costly (spam)
Recall	TP/(TP+FN)	FN costly (cancer)
F1	2PR/(P+R)	Imbalanced
AUC-ROC	Area under ROC	Ranking quality
PR-AUC	Area under P-R	Highly imbalanced
MCC	Balanced metric	Very imbalanced

Unsupervised Models

Model	Use	Key Params
K-Means	Spherical clusters	n_clusters
DBSCAN	Arbitrary shape, outliers	eps, min_samples
Hierarchical	Dendrogram, variable K	n_clusters, linkage
PCA	Dimensionality reduction	n_components
t-SNE	Visualisation only	perplexity
Iso. Forest	Anomaly detection	contamination
LOF	Local density anomaly	n_neighbors

sklearn Cheat Sheet

from sklearn.XX import Model
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Pipeline (prevents leakage!)
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Model(**params))
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

# Cross-validation
cv_scores = cross_val_score(pipe, X, y, cv=5, scoring='roc_auc')
print(f"AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV
import optuna

# Grid Search
grid = GridSearchCV(model, {'C':[0.1,1,10]}, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_, grid.best_score_)

# Optuna (Bayesian)
def objective(trial):
    C = trial.suggest_float('C',0.01,10,log=True)
    model = SVC(C=C)
    return cross_val_score(model,X,y,cv=3).mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print(study.best_params)

Feature Engineering Quick Ref

Transform	Code
Log	`np.log1p(df["col"])`
Sqrt	`np.sqrt(df["col"])`
Box-Cox	`stats.boxcox(df["col"])`
Bin	`pd.cut(df["col"],5)`
Interaction	`df["a"]*df["b"]`
Polynomial	`PolynomialFeatures(2)`
TF-IDF	`TfidfVectorizer()`
Label enc	`LabelEncoder()`
One-hot	`pd.get_dummies()`