← All Cheatsheets

Deep Learning Layers & Architectures

Layers · Activations · Optimisers · Architectures · PyTorch code
mitraaiprojects.com

Core Layers (PyTorch)

LayerUseKey Params
nn.Linear(in,out)Fully connectedin_features, out_features
nn.Conv2d(in,out,k)2D convolutionin_ch, out_ch, kernel_size, stride, padding
nn.MaxPool2d(k)Spatial downsamplingkernel_size, stride
nn.AvgPool2d(k)Average poolingkernel_size
nn.BatchNorm2d(ch)Normalise per channelnum_features
nn.LayerNorm(d)Normalise per samplenormalized_shape
nn.Dropout(p)Regularisationp (default 0.5)
nn.Embedding(n,d)Learned token embeddingsnum_embeddings, emb_dim
nn.LSTM(in,h,L)Sequence modellinginput_size, hidden_size, num_layers
nn.GRU(in,h,L)Efficient sequencesame as LSTM, 33% fewer params
nn.MultiheadAttentionSelf-attentionembed_dim, num_heads
nn.TransformerEncoderStack of encoder layersencoder_layer, num_layers

Activation Functions

ActivationFormulaUse
ReLUmax(0,x)Default hidden layers — fast, no vanishing
LeakyReLUx if x>0 else 0.01xDying ReLU problem
GELUx·Φ(x)Transformers — smoother than ReLU
Sigmoid1/(1+e^−x)Binary output only (saturates)
Tanh(e^x−e^−x)/(e^x+e^−x)Zero-centred, RNNs
Softmaxe^xi / Σe^xjMulti-class output (probabilities)
Swishx·sigmoid(x)EfficientNet — non-monotonic

Loss Functions

LossTaskPyTorch
MSERegressionnn.MSELoss()
MAE / L1Regression (robust)nn.L1Loss()
BCEBinary classificationnn.BCEWithLogitsLoss()
Cross-EntropyMulti-classnn.CrossEntropyLoss()
KL DivergenceDistributionsnn.KLDivLoss()
ContrastiveSimilarity learningcustom

Optimisers

OptimiserBest forKey Params
AdamDefault — most problemslr=3e-4, betas=(0.9,0.999)
AdamWTransformers — decoupled WDlr=3e-4, weight_decay=0.01
SGD+momentumCNNs — often better final acclr=0.01, momentum=0.9
RMSPropRNNs, non-stationarylr=1e-3
LionLarge models — memory efficientlr=3e-5
LR scheduling: CosineAnnealing, ReduceLROnPlateau, OneCycleLR

Regularisation Techniques

TechniqueEffectWhere
DropoutRandom deactivationDense/FC layers
L2 / Weight DecayPenalise large weightsoptimizer(weight_decay=0.01)
BatchNormNormalise + noiseAfter conv/linear
Data AugmentationMore effective training dataImage transforms
Early StoppingStop before overfitMonitor val_loss
Gradient ClippingPrevent explosionclip_grad_norm_(1.0)
Label SmoothingSoften targetsCrossEntropyLoss(label_smoothing=0.1)

PyTorch Training Loop Template

model = MyModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
criterion = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    model.train()
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        out = model(X)
        loss = criterion(out, y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
    scheduler.step()

    model.eval()
    with torch.no_grad():
        for X, y in val_loader:
            out = model(X.to(device))
            val_loss += criterion(out, y.to(device)).item()

Famous Architectures

ArchitectureYearInnovationUse today
ResNet-502015Skip connections (residual blocks)Transfer learning backbone
VGG-162014Very deep 3×3 convolutionsFeature extraction
EfficientNet2019Compound scaling (width/depth/res)Edge deployment
BERT2018Bidirectional masked LM pre-trainingText classification, NER
GPT-2/3/42019+Causal LM pre-training at scaleText generation
ViT2020Image patches as tokensImage classification
Stable Diffusion2022Latent diffusion modelImage generation
LLaMA2023Open weights GPT-class modelLocal LLM, fine-tuning