foundations
What is RL
Agent, environment, reward, policy โ the RL framework
RL vs Supervised Learning
| Supervised | Reinforcement | |
|---|---|---|
| Feedback | Immediate label | Delayed reward |
| Data | Fixed labelled dataset | Collected by agent |
| Goal | Minimise prediction error | Maximise cumulative reward |
| Examples | Classification, regression | Games, robotics, LLM alignment |
Core Components
Agent: The decision maker.
Environment: What the agent interacts with.
State (s): Current situation.
Action (a): What the agent does.
Reward (r): Feedback signal.
Policy (ฯ): Maps states to actions.
Value function: Expected future reward from a state.
Environment: What the agent interacts with.
State (s): Current situation.
Action (a): What the agent does.
Reward (r): Feedback signal.
Policy (ฯ): Maps states to actions.
Value function: Expected future reward from a state.
Interactive Notebook
Notebook: What is RL
Agent, environment, reward, policy โ the RL framework
First load ~30-60s ยท Saves automatically
tabular rl
Q-Learning
Learn Q(s,a) values through trial and error in discrete spaces
Bellman Equation
q_update.py
# Q-Learning update rule
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
# alpha = learning rate (0.1)
# gamma = discount factor (0.99)
# Choose action: epsilon-greedy
if np.random.rand() < epsilon:
action = np.random.randint(n_actions) # explore
else:
action = np.argmax(Q[state]) # exploitInteractive Notebook
Notebook: Q-Learning
Learn Q(s,a) values through trial and error in discrete spaces
First load ~30-60s ยท Saves automatically
neural q-learning
Deep Q Networks
Neural network approximates Q-values for large state spaces
DQN Key Innovations
| Innovation | Purpose |
|---|---|
| Experience Replay | Break correlation between consecutive samples |
| Target Network | Stable Q-value targets โ updated every N steps |
| Epsilon-greedy | Balance exploration vs exploitation |
DQN Architecture
dqn.py
# Run on Kaggle GPU with stable-baselines3
from stable_baselines3 import DQN
model = DQN('MlpPolicy', 'CartPole-v1', verbose=1)
model.learn(total_timesteps=100_000)
obs, _ = env.reset()
action, _ = model.predict(obs, deterministic=True)Interactive Notebook
Notebook: Deep Q Networks
Neural network approximates Q-values for large state spaces
First load ~30-60s ยท Saves automatically
direct policy optimisation
Policy Gradients
Directly optimise ฯ(a|s) โ handles continuous actions
REINFORCE Algorithm
reinforce.py
# Policy gradient update
for episode in range(num_episodes):
states, actions, rewards = collect_episode(policy)
returns = compute_returns(rewards, gamma=0.99)
# Update: reinforce actions that led to high returns
loss = -torch.mean(returns * log_probs)Actor-Critic
Actor: learns ฯ(a|s) โ what action to take.
Critic: learns V(s) โ how good the current state is.
Advantage A(s,a) = Q(s,a) - V(s) โ how much better was this action?
Critic: learns V(s) โ how good the current state is.
Advantage A(s,a) = Q(s,a) - V(s) โ how much better was this action?
Interactive Notebook
Notebook: Policy Gradients
Directly optimise ฯ(a|s) โ handles continuous actions
First load ~30-60s ยท Saves automatically
proximal policy optimisation
PPO
Most popular practical RL algorithm โ stable and versatile
Why PPO?
Stability: Clips policy ratio to prevent catastrophic updates.
Simplicity: First-order optimisation (no second-order TRPO math).
Versatility: Works for discrete and continuous actions.
Used in: ChatGPT/InstructGPT RLHF, OpenAI Five, robotic manipulation.
Simplicity: First-order optimisation (no second-order TRPO math).
Versatility: Works for discrete and continuous actions.
Used in: ChatGPT/InstructGPT RLHF, OpenAI Five, robotic manipulation.
PPO with stable-baselines3
ppo_train.py
# pip install stable-baselines3[extra] gymnasium
from stable_baselines3 import PPO
import gymnasium as gym
env = gym.make('CartPole-v1')
model = PPO('MlpPolicy', env, verbose=1, learning_rate=3e-4)
model.learn(total_timesteps=500_000)
model.save('ppo_cartpole')Interactive Notebook
Notebook: PPO
Most popular practical RL algorithm โ stable and versatile
First load ~30-60s ยท Saves automatically
applications
Real-World RL
Recommendations, RLHF, robotics, resource management
Real-World Applications
| Domain | RL Application |
|---|---|
| LLM Alignment | RLHF โ reward human preference in ChatGPT, Claude |
| Recommendations | Optimise long-term engagement (YouTube, TikTok) |
| Robotics | Walking, manipulation (Boston Dynamics, DeepMind) |
| Ad bidding | Optimise bids across millions of auctions |
| HVAC control | 30%+ energy reduction in Google data centres |
| Chip design | Google AlphaChip designs better chips than humans |
RLHF Pipeline
Step 1: Supervised fine-tuning (SFT) on human demonstrations.
Step 2: Human raters compare model outputs. Train a reward model.
Step 3: PPO fine-tunes the LLM to maximise the reward model's score.
Result: Helpful, harmless, honest AI assistants.
Step 2: Human raters compare model outputs. Train a reward model.
Step 3: PPO fine-tunes the LLM to maximise the reward model's score.
Result: Helpful, harmless, honest AI assistants.
Interactive Notebook
Notebook: Real-World RL
Recommendations, RLHF, robotics, resource management
First load ~30-60s ยท Saves automatically