| Model | When to Use | Key Param |
|---|---|---|
| Linear Reg | Linear relationships, interpretability | — |
| Ridge (L2) | Multicollinearity, many features | alpha |
| Lasso (L1) | Feature selection, sparse data | alpha |
| ElasticNet | Correlated + irrelevant features | alpha, l1_ratio |
| Polynomial | Curved relationships (add features) | degree |
| SVR | Small-medium, non-linear | C, kernel, eps |
| Random Forest | Non-linear, robust to outliers | n_estimators |
| XGBoost | Tabular, best accuracy | lr, n_est, depth |
| Metric | Formula | Best for |
|---|---|---|
| MAE | mean(|y-ŷ|) | Outlier-robust |
| MSE | mean((y-ŷ)²) | Penalises large errors |
| RMSE | √MSE | Same units as target |
| R² | 1−SS_res/SS_tot | Explained variance |
| MAPE | mean(|y-ŷ|/|y|) | Scale-free % |
| Model | Strengths | Weaknesses | Key Params |
|---|---|---|---|
| Logistic Reg | Fast, interpretable, calibrated | Linear boundary only | C, max_iter |
| Decision Tree | Interpretable, no scaling | Overfits easily | max_depth, min_samples |
| Random Forest | Robust, handles missing, feature imp | Slow, black box | n_estimators, max_depth |
| XGBoost | Best tabular accuracy, regularised | Many params | learning_rate, n_estimators, max_depth |
| LightGBM | Fastest boosting, big data | Overfits small data | num_leaves, min_data |
| SVM | Great for small data, non-linear via kernel | Slow on large data | C, kernel, gamma |
| KNN | Simple, no training, non-linear | Slow prediction, scaling needed | n_neighbors, metric |
| Naive Bayes | Fast, text, small data | Feature independence assumption | var_smoothing |
| MLP | Complex patterns, flexible | Black box, slow | hidden_layers, lr |
| Metric | Formula | Use when |
|---|---|---|
| Accuracy | correct/total | Balanced classes |
| Precision | TP/(TP+FP) | FP costly (spam) |
| Recall | TP/(TP+FN) | FN costly (cancer) |
| F1 | 2*P*R/(P+R) | Imbalanced |
| AUC-ROC | Area under ROC | Ranking quality |
| PR-AUC | Area under P-R | Highly imbalanced |
| MCC | Balanced metric | Very imbalanced |
| Model | Use | Key Params |
|---|---|---|
| K-Means | Spherical clusters | n_clusters |
| DBSCAN | Arbitrary shape, outliers | eps, min_samples |
| Hierarchical | Dendrogram, variable K | n_clusters, linkage |
| PCA | Dimensionality reduction | n_components |
| t-SNE | Visualisation only | perplexity |
| Iso. Forest | Anomaly detection | contamination |
| LOF | Local density anomaly | n_neighbors |
from sklearn.XX import Model
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
# Pipeline (prevents leakage!)
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('model', Model(**params))
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
# Cross-validation
cv_scores = cross_val_score(pipe, X, y, cv=5, scoring='roc_auc')
print(f"AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
from sklearn.model_selection import GridSearchCV
import optuna
# Grid Search
grid = GridSearchCV(model, {'C':[0.1,1,10]}, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_, grid.best_score_)
# Optuna (Bayesian)
def objective(trial):
C = trial.suggest_float('C',0.01,10,log=True)
model = SVC(C=C)
return cross_val_score(model,X,y,cv=3).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print(study.best_params)
| Transform | Code |
|---|---|
| Log | np.log1p(df["col"]) |
| Sqrt | np.sqrt(df["col"]) |
| Box-Cox | stats.boxcox(df["col"]) |
| Bin | pd.cut(df["col"],5) |
| Interaction | df["a"]*df["b"] |
| Polynomial | PolynomialFeatures(2) |
| TF-IDF | TfidfVectorizer() |
| Label enc | LabelEncoder() |
| One-hot | pd.get_dummies() |