RLReinforcement Learning
All Courses
foundations

What is RL

Agent, environment, reward, policy โ€” the RL framework

RL vs Supervised Learning

SupervisedReinforcement
FeedbackImmediate labelDelayed reward
DataFixed labelled datasetCollected by agent
GoalMinimise prediction errorMaximise cumulative reward
ExamplesClassification, regressionGames, robotics, LLM alignment

Core Components

Agent: The decision maker.
Environment: What the agent interacts with.
State (s): Current situation.
Action (a): What the agent does.
Reward (r): Feedback signal.
Policy (ฯ€): Maps states to actions.
Value function: Expected future reward from a state.

Interactive Notebook

โšก
Notebook: What is RL
Agent, environment, reward, policy โ€” the RL framework
First load ~30-60s ยท Saves automatically
Open Notebook

Quiz

Test your understanding of What is RL -- 10 questions, 70% to pass.

Take Quiz
tabular rl

Q-Learning

Learn Q(s,a) values through trial and error in discrete spaces

Bellman Equation

q_update.py
# Q-Learning update rule
Q[state, action] += alpha * (
    reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
# alpha = learning rate (0.1)
# gamma = discount factor (0.99)
# Choose action: epsilon-greedy
if np.random.rand() < epsilon:
    action = np.random.randint(n_actions)  # explore
else:
    action = np.argmax(Q[state])           # exploit

Interactive Notebook

โšก
Notebook: Q-Learning
Learn Q(s,a) values through trial and error in discrete spaces
First load ~30-60s ยท Saves automatically
Open Notebook

Quiz

Test your understanding of Q-Learning -- 10 questions, 70% to pass.

Take Quiz
neural q-learning

Deep Q Networks

Neural network approximates Q-values for large state spaces

DQN Key Innovations

InnovationPurpose
Experience ReplayBreak correlation between consecutive samples
Target NetworkStable Q-value targets โ€” updated every N steps
Epsilon-greedyBalance exploration vs exploitation

DQN Architecture

dqn.py
# Run on Kaggle GPU with stable-baselines3
from stable_baselines3 import DQN
model = DQN('MlpPolicy', 'CartPole-v1', verbose=1)
model.learn(total_timesteps=100_000)
obs, _ = env.reset()
action, _ = model.predict(obs, deterministic=True)

Interactive Notebook

โšก
Notebook: Deep Q Networks
Neural network approximates Q-values for large state spaces
First load ~30-60s ยท Saves automatically
Open Notebook

Quiz

Test your understanding of Deep Q Networks -- 10 questions, 70% to pass.

Take Quiz
direct policy optimisation

Policy Gradients

Directly optimise ฯ€(a|s) โ€” handles continuous actions

REINFORCE Algorithm

reinforce.py
# Policy gradient update
for episode in range(num_episodes):
    states, actions, rewards = collect_episode(policy)
    returns = compute_returns(rewards, gamma=0.99)
    
    # Update: reinforce actions that led to high returns
    loss = -torch.mean(returns * log_probs)

Actor-Critic

Actor: learns ฯ€(a|s) โ€” what action to take.
Critic: learns V(s) โ€” how good the current state is.
Advantage A(s,a) = Q(s,a) - V(s) โ€” how much better was this action?

Interactive Notebook

โšก
Notebook: Policy Gradients
Directly optimise ฯ€(a|s) โ€” handles continuous actions
First load ~30-60s ยท Saves automatically
Open Notebook

Quiz

Test your understanding of Policy Gradients -- 10 questions, 70% to pass.

Take Quiz
proximal policy optimisation

PPO

Most popular practical RL algorithm โ€” stable and versatile

Why PPO?

Stability: Clips policy ratio to prevent catastrophic updates.
Simplicity: First-order optimisation (no second-order TRPO math).
Versatility: Works for discrete and continuous actions.
Used in: ChatGPT/InstructGPT RLHF, OpenAI Five, robotic manipulation.

PPO with stable-baselines3

ppo_train.py
# pip install stable-baselines3[extra] gymnasium
from stable_baselines3 import PPO
import gymnasium as gym

env = gym.make('CartPole-v1')
model = PPO('MlpPolicy', env, verbose=1, learning_rate=3e-4)
model.learn(total_timesteps=500_000)
model.save('ppo_cartpole')

Interactive Notebook

โšก
Notebook: PPO
Most popular practical RL algorithm โ€” stable and versatile
First load ~30-60s ยท Saves automatically
Open Notebook

Quiz

Test your understanding of PPO -- 10 questions, 70% to pass.

Take Quiz
applications

Real-World RL

Recommendations, RLHF, robotics, resource management

Real-World Applications

DomainRL Application
LLM AlignmentRLHF โ€” reward human preference in ChatGPT, Claude
RecommendationsOptimise long-term engagement (YouTube, TikTok)
RoboticsWalking, manipulation (Boston Dynamics, DeepMind)
Ad biddingOptimise bids across millions of auctions
HVAC control30%+ energy reduction in Google data centres
Chip designGoogle AlphaChip designs better chips than humans

RLHF Pipeline

Step 1: Supervised fine-tuning (SFT) on human demonstrations.
Step 2: Human raters compare model outputs. Train a reward model.
Step 3: PPO fine-tunes the LLM to maximise the reward model's score.
Result: Helpful, harmless, honest AI assistants.

Interactive Notebook

โšก
Notebook: Real-World RL
Recommendations, RLHF, robotics, resource management
First load ~30-60s ยท Saves automatically
Open Notebook

Quiz

Test your understanding of Real-World RL -- 10 questions, 70% to pass.

Take Quiz
๐ŸŽ“

Reinforcement Learning Complete!

Pass all topic quizzes to earn your certificate.

Browse Next Course