mlops & tools · production-minded learners

MLOps & Tools

Airflow orchestration, Spark and Big Data, MLflow experiment tracking, Kubernetes for ML, SageMaker, Docker, and CI/CD for models. Mapped to AIML-Engineering-Lab patterns and spark-learning-guide.

MLOPSMLOps & Tools
All Courses
053 · airflow patterns

Apache Airflow

Workflow orchestration for data pipelines and ML jobs. Define pipelines as Directed Acyclic Graphs (DAGs) in Python. Schedule, monitor, and retry jobs from a web UI.

Core Concepts

ConceptMeaning
DAGDirected Acyclic Graph — your pipeline definition
TaskA single unit of work (PythonOperator, BashOperator, etc.)
OperatorTemplate for a task type — Python, Bash, SQL, HTTP, etc.
SchedulerParses DAGs and triggers runs based on schedule intervals
XComCross-communication — pass small values between tasks
SensorWaits for a condition (file exists, API ready, time elapsed)

Code Example

ml_pipeline_dag.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def train_model(**ctx):
    # load data, train, save model
    print("Training...")

def evaluate_model(**ctx):
    print("Evaluating...")

with DAG("ml_pipeline",
          schedule="0 2 * * *",   # daily at 2am
          start_date=datetime(2026, 1, 1)) as dag:

    train = PythonOperator(task_id="train", python_callable=train_model)
    eval_  = PythonOperator(task_id="eval",  python_callable=evaluate_model)
    train >> eval_   # train runs first

Interactive Notebook

Notebook: Airflow
DAG simulation, operators, dependencies
First load ~30-60s · Saves automatically
Open Notebook

Quiz

Test your understanding -- 10 questions, 70% to pass.

Take Quiz
spark-learning-guide

Apache Spark & Big Data

Distributed data processing for datasets too large for a single machine. Spark processes data in-memory across a cluster, making it 10–100× faster than Hadoop MapReduce for iterative algorithms.

Core Abstractions

AbstractionWhat it is
RDDResilient Distributed Dataset — the original Spark abstraction
DataFrameDistributed table with schema — preferred for most use cases
DatasetTyped DataFrame — JVM languages (Scala/Java)
Spark SQLSQL interface on top of DataFrames
MLlibSpark's distributed ML library
Structured StreamingReal-time stream processing with DataFrame API

Code Example

spark_job.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg

spark = SparkSession.builder.appName("MLPipeline").getOrCreate()

df = spark.read.parquet("s3://bucket/data/")
result = (df
    .filter(col("score") > 0)
    .groupBy("category")
    .agg(avg("score").alias("avg_score"))
    .orderBy("avg_score", ascending=False))

result.show(10)

Interactive Notebook

Notebook: Spark
PySpark concepts, distributed processing
First load ~30-60s · Saves automatically
Open Notebook

Quiz

Test your understanding -- 10 questions, 70% to pass.

Take Quiz
experiment tracking

MLflow — Experiment Tracking

Track, compare, and reproduce ML experiments. MLflow records parameters, metrics, artifacts, and code versions — so you always know which run produced the best model.

MLflow Components

ComponentWhat it does
TrackingLog parameters, metrics, and artifacts for each run
ProjectsPackage ML code for reproducible runs
ModelsStandard model packaging format for multiple deployment targets
RegistryCentralized model store with versioning and staging

Code Example

mlflow_tracking.py
import mlflow
import mlflow.sklearn

mlflow.set_experiment("house-price-predictor")

with mlflow.start_run():
    mlflow.log_param("alpha", 0.5)
    mlflow.log_param("model", "ridge")

    model.fit(X_train, y_train)
    rmse = compute_rmse(model, X_test, y_test)

    mlflow.log_metric("rmse", rmse)
    mlflow.sklearn.log_model(model, "model")
    print(f"Run ID: {mlflow.active_run().info.run_id}")

Interactive Notebook

Notebook: MLflow
Experiment tracking simulation
First load ~30-60s · Saves automatically
Open Notebook

Quiz

Test your understanding -- 10 questions, 70% to pass.

Take Quiz
kubernetes for ml

Kubernetes for ML

Container orchestration at scale. Kubernetes manages your ML workloads across clusters — scheduling training jobs, scaling inference services, and ensuring reliability.

Key Objects

ObjectRole
PodSmallest deployable unit — one or more containers
DeploymentManages replicas of a Pod, handles rolling updates
ServiceExposes Pods via a stable IP or DNS name
ConfigMap / SecretInject config and credentials without rebuilding images
Job / CronJobRun batch tasks (model training) once or on a schedule
PersistentVolumeDurable storage for datasets and model artifacts

Interactive Notebook

Notebook: Kubernetes
Deployment, scaling, rolling updates
First load ~30-60s · Saves automatically
Open Notebook

Quiz

Test your understanding -- 10 questions, 70% to pass.

Take Quiz
aws sagemaker

AWS SageMaker

Fully managed ML platform — from training to deployment. SageMaker handles infrastructure provisioning, model hosting, batch transform, and A/B testing in production.

Core Workflow

S3 (data)
Training Job
Model Registry
Endpoint
Your App

Key Features

Managed Spot TrainingAutopilot (AutoML)Feature StorePipelinesStudioJumpStart

⚠ Watch-outs

Interactive Notebook

Notebook: SageMaker
Training jobs, endpoints, cost
First load ~30-60s · Saves automatically
Open Notebook

Quiz

Test your understanding -- 10 questions, 70% to pass.

Take Quiz
docker for ml

Docker for ML

Package your ML models and their dependencies into containers. Docker eliminates "works on my machine" problems and enables consistent deployment across dev, staging, and production.

ML Dockerfile Pattern

Dockerfile
# Use Python 3.11 slim base
FROM python:3.11-slim

WORKDIR /app

# Install dependencies first (for layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy app code
COPY . .

# Copy model artifact
COPY models/model.pkl /app/models/

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Interactive Notebook

Notebook: Docker
Dockerfile, layer caching, best practices
First load ~30-60s · Saves automatically
Open Notebook

Quiz

Test your understanding -- 10 questions, 70% to pass.

Take Quiz
ci/cd for models

CI/CD for ML Models

Automate the path from code commit to deployed model. ML-specific CI/CD adds model performance gates, data validation, and rollback triggers on top of standard software CI/CD.

ML CI/CD Pipeline

Git Push
Run Tests
Train Model
Eval Gate
Deploy

Eval Gate: block deployment if new model is worse than baseline by more than X%

GitHub Actions ML Example

.github/workflows/ml-pipeline.yml
name: ML Pipeline
on: [push]
jobs:
  train-and-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install -r requirements.txt
      - run: python train.py
      - run: python evaluate.py  # exits 1 if model below threshold
      - run: python deploy.py     # pushes to SageMaker or Docker

Interactive Notebook

Notebook: CI/CD for Models
Model validation gates, quality thresholds, deployment pipeline
First load ~30-60s · Saves automatically
Open Notebook

Quiz

Test your understanding of CI/CD for Models -- 10 questions, 70% to pass.

Take Quiz