Reinforcement Learning — Mitra AI Projects

foundations

What is RL

Agent, environment, reward, policy — the RL framework

RL vs Supervised Learning

	Supervised	Reinforcement
Feedback	Immediate label	Delayed reward
Data	Fixed labelled dataset	Collected by agent
Goal	Minimise prediction error	Maximise cumulative reward
Examples	Classification, regression	Games, robotics, LLM alignment

Core Components

Agent: The decision maker.

        Environment: What the agent interacts with.

        State (s): Current situation.

        Action (a): What the agent does.

        Reward (r): Feedback signal.

        Policy (π): Maps states to actions.

        Value function: Expected future reward from a state.

Interactive Notebook

⚡

Notebook: What is RL

Agent, environment, reward, policy — the RL framework

First load ~30-60s · Saves automatically

Open Notebook

Quiz

Test your understanding of What is RL -- 10 questions, 70% to pass.

Take Quiz

tabular rl

Q-Learning

Learn Q(s,a) values through trial and error in discrete spaces

Bellman Equation

q_update.py

# Q-Learning update rule
Q[state, action] += alpha * (
    reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
# alpha = learning rate (0.1)
# gamma = discount factor (0.99)
# Choose action: epsilon-greedy
if np.random.rand() < epsilon:
    action = np.random.randint(n_actions)  # explore
else:
    action = np.argmax(Q[state])           # exploit

Interactive Notebook

⚡

Notebook: Q-Learning

Learn Q(s,a) values through trial and error in discrete spaces

First load ~30-60s · Saves automatically

Open Notebook

Quiz

Test your understanding of Q-Learning -- 10 questions, 70% to pass.

Take Quiz

neural q-learning

Deep Q Networks

Neural network approximates Q-values for large state spaces

DQN Key Innovations

Innovation	Purpose
Experience Replay	Break correlation between consecutive samples
Target Network	Stable Q-value targets — updated every N steps
Epsilon-greedy	Balance exploration vs exploitation

DQN Architecture

dqn.py

# Run on Kaggle GPU with stable-baselines3
from stable_baselines3 import DQN
model = DQN('MlpPolicy', 'CartPole-v1', verbose=1)
model.learn(total_timesteps=100_000)
obs, _ = env.reset()
action, _ = model.predict(obs, deterministic=True)

Interactive Notebook

⚡

Notebook: Deep Q Networks

Neural network approximates Q-values for large state spaces

First load ~30-60s · Saves automatically

Open Notebook

Quiz

Test your understanding of Deep Q Networks -- 10 questions, 70% to pass.

Take Quiz

direct policy optimisation

Policy Gradients

Directly optimise π(a|s) — handles continuous actions

REINFORCE Algorithm

reinforce.py

# Policy gradient update
for episode in range(num_episodes):
    states, actions, rewards = collect_episode(policy)
    returns = compute_returns(rewards, gamma=0.99)
    
    # Update: reinforce actions that led to high returns
    loss = -torch.mean(returns * log_probs)

Actor-Critic

Actor: learns π(a|s) — what action to take.

        Critic: learns V(s) — how good the current state is.

        Advantage A(s,a) = Q(s,a) - V(s) — how much better was this action?

Interactive Notebook

⚡

Notebook: Policy Gradients

Directly optimise π(a|s) — handles continuous actions

First load ~30-60s · Saves automatically

Open Notebook

Quiz

Test your understanding of Policy Gradients -- 10 questions, 70% to pass.

Take Quiz

proximal policy optimisation

PPO

Most popular practical RL algorithm — stable and versatile

Why PPO?

        Stability: Clips policy ratio to prevent catastrophic updates.

        Simplicity: First-order optimisation (no second-order TRPO math).

        Versatility: Works for discrete and continuous actions.

        Used in: ChatGPT/InstructGPT RLHF, OpenAI Five, robotic manipulation.

PPO with stable-baselines3

ppo_train.py

# pip install stable-baselines3[extra] gymnasium
from stable_baselines3 import PPO
import gymnasium as gym

env = gym.make('CartPole-v1')
model = PPO('MlpPolicy', env, verbose=1, learning_rate=3e-4)
model.learn(total_timesteps=500_000)
model.save('ppo_cartpole')

Interactive Notebook

⚡

Notebook: PPO

Most popular practical RL algorithm — stable and versatile

First load ~30-60s · Saves automatically

Open Notebook

Quiz

Test your understanding of PPO -- 10 questions, 70% to pass.

Take Quiz

applications

Real-World RL

Recommendations, RLHF, robotics, resource management

Real-World Applications

Domain	RL Application
LLM Alignment	RLHF — reward human preference in ChatGPT, Claude
Recommendations	Optimise long-term engagement (YouTube, TikTok)
Robotics	Walking, manipulation (Boston Dynamics, DeepMind)
Ad bidding	Optimise bids across millions of auctions
HVAC control	30%+ energy reduction in Google data centres
Chip design	Google AlphaChip designs better chips than humans

RLHF Pipeline

Step 1: Supervised fine-tuning (SFT) on human demonstrations.

        Step 2: Human raters compare model outputs. Train a reward model.

        Step 3: PPO fine-tunes the LLM to maximise the reward model's score.

        Result: Helpful, harmless, honest AI assistants.

Interactive Notebook

⚡

Notebook: Real-World RL

Recommendations, RLHF, robotics, resource management

First load ~30-60s · Saves automatically

Open Notebook

Quiz

Test your understanding of Real-World RL -- 10 questions, 70% to pass.

Take Quiz

What is RL

RL vs Supervised Learning

Core Components

Interactive Notebook

Quiz

Q-Learning

Bellman Equation

Interactive Notebook

Quiz

Deep Q Networks

DQN Key Innovations

DQN Architecture

Interactive Notebook

Quiz

Policy Gradients

REINFORCE Algorithm

Actor-Critic

Interactive Notebook

Quiz

PPO

Why PPO?

PPO with stable-baselines3

Interactive Notebook

Quiz

Real-World RL

Real-World Applications

RLHF Pipeline

Interactive Notebook

Quiz

Reinforcement Learning Complete!