| Layer | Use | Key Params |
|---|---|---|
nn.Linear(in,out) | Fully connected | in_features, out_features |
nn.Conv2d(in,out,k) | 2D convolution | in_ch, out_ch, kernel_size, stride, padding |
nn.MaxPool2d(k) | Spatial downsampling | kernel_size, stride |
nn.AvgPool2d(k) | Average pooling | kernel_size |
nn.BatchNorm2d(ch) | Normalise per channel | num_features |
nn.LayerNorm(d) | Normalise per sample | normalized_shape |
nn.Dropout(p) | Regularisation | p (default 0.5) |
nn.Embedding(n,d) | Learned token embeddings | num_embeddings, emb_dim |
nn.LSTM(in,h,L) | Sequence modelling | input_size, hidden_size, num_layers |
nn.GRU(in,h,L) | Efficient sequence | same as LSTM, 33% fewer params |
nn.MultiheadAttention | Self-attention | embed_dim, num_heads |
nn.TransformerEncoder | Stack of encoder layers | encoder_layer, num_layers |
| Activation | Formula | Use |
|---|---|---|
| ReLU | max(0,x) | Default hidden layers — fast, no vanishing |
| LeakyReLU | x if x>0 else 0.01x | Dying ReLU problem |
| GELU | x·Φ(x) | Transformers — smoother than ReLU |
| Sigmoid | 1/(1+e^−x) | Binary output only (saturates) |
| Tanh | (e^x−e^−x)/(e^x+e^−x) | Zero-centred, RNNs |
| Softmax | e^xi / Σe^xj | Multi-class output (probabilities) |
| Swish | x·sigmoid(x) | EfficientNet — non-monotonic |
| Loss | Task | PyTorch |
|---|---|---|
| MSE | Regression | nn.MSELoss() |
| MAE / L1 | Regression (robust) | nn.L1Loss() |
| BCE | Binary classification | nn.BCEWithLogitsLoss() |
| Cross-Entropy | Multi-class | nn.CrossEntropyLoss() |
| KL Divergence | Distributions | nn.KLDivLoss() |
| Contrastive | Similarity learning | custom |
| Optimiser | Best for | Key Params |
|---|---|---|
| Adam | Default — most problems | lr=3e-4, betas=(0.9,0.999) |
| AdamW | Transformers — decoupled WD | lr=3e-4, weight_decay=0.01 |
| SGD+momentum | CNNs — often better final acc | lr=0.01, momentum=0.9 |
| RMSProp | RNNs, non-stationary | lr=1e-3 |
| Lion | Large models — memory efficient | lr=3e-5 |
| Technique | Effect | Where |
|---|---|---|
| Dropout | Random deactivation | Dense/FC layers |
| L2 / Weight Decay | Penalise large weights | optimizer(weight_decay=0.01) |
| BatchNorm | Normalise + noise | After conv/linear |
| Data Augmentation | More effective training data | Image transforms |
| Early Stopping | Stop before overfit | Monitor val_loss |
| Gradient Clipping | Prevent explosion | clip_grad_norm_(1.0) |
| Label Smoothing | Soften targets | CrossEntropyLoss(label_smoothing=0.1) |
model = MyModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
criterion = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
model.train()
for X, y in train_loader:
X, y = X.to(device), y.to(device)
optimizer.zero_grad()
out = model(X)
loss = criterion(out, y)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
model.eval()
with torch.no_grad():
for X, y in val_loader:
out = model(X.to(device))
val_loss += criterion(out, y.to(device)).item()
| Architecture | Year | Innovation | Use today |
|---|---|---|---|
| ResNet-50 | 2015 | Skip connections (residual blocks) | Transfer learning backbone |
| VGG-16 | 2014 | Very deep 3×3 convolutions | Feature extraction |
| EfficientNet | 2019 | Compound scaling (width/depth/res) | Edge deployment |
| BERT | 2018 | Bidirectional masked LM pre-training | Text classification, NER |
| GPT-2/3/4 | 2019+ | Causal LM pre-training at scale | Text generation |
| ViT | 2020 | Image patches as tokens | Image classification |
| Stable Diffusion | 2022 | Latent diffusion model | Image generation |
| LLaMA | 2023 | Open weights GPT-class model | Local LLM, fine-tuning |