MLOps and Tools Course — Mitra AI Projects

053 · airflow patterns

Apache Airflow

Workflow orchestration for data pipelines and ML jobs. Define pipelines as Directed Acyclic Graphs (DAGs) in Python. Schedule, monitor, and retry jobs from a web UI.

Core Concepts

Concept	Meaning
DAG	Directed Acyclic Graph — your pipeline definition
Task	A single unit of work (PythonOperator, BashOperator, etc.)
Operator	Template for a task type — Python, Bash, SQL, HTTP, etc.
Scheduler	Parses DAGs and triggers runs based on schedule intervals
XCom	Cross-communication — pass small values between tasks
Sensor	Waits for a condition (file exists, API ready, time elapsed)

Code Example

ml_pipeline_dag.py

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def train_model(**ctx):
    # load data, train, save model
    print("Training...")

def evaluate_model(**ctx):
    print("Evaluating...")

with DAG("ml_pipeline",
          schedule="0 2 * * *",   # daily at 2am
          start_date=datetime(2026, 1, 1)) as dag:

    train = PythonOperator(task_id="train", python_callable=train_model)
    eval_  = PythonOperator(task_id="eval",  python_callable=evaluate_model)
    train >> eval_   # train runs first

Interactive Notebook

⚡

Notebook: Airflow

DAG simulation, operators, dependencies

First load ~30-60s · Saves automatically

Open Notebook

Quiz

Test your understanding -- 10 questions, 70% to pass.

Take Quiz

Abstraction	What it is
RDD	Resilient Distributed Dataset — the original Spark abstraction
DataFrame	Distributed table with schema — preferred for most use cases
Dataset	Typed DataFrame — JVM languages (Scala/Java)
Spark SQL	SQL interface on top of DataFrames
MLlib	Spark's distributed ML library
Structured Streaming	Real-time stream processing with DataFrame API

Abstraction

What it is

RDD

Resilient Distributed Dataset — the original Spark abstraction

DataFrame

Distributed table with schema — preferred for most use cases

Dataset

Typed DataFrame — JVM languages (Scala/Java)

Spark SQL

SQL interface on top of DataFrames

MLlib

Spark's distributed ML library

Structured Streaming

Real-time stream processing with DataFrame API

from pyspark.sql import SparkSession from pyspark.sql.functions import col, avg spark = SparkSession.builder.appName("MLPipeline").getOrCreate() df = spark.read.parquet("s3://bucket/data/") result = (df .filter(col("score") > 0) .groupBy("category") .agg(avg("score").alias("avg_score")) .orderBy("avg_score", ascending=False)) result.show(10)

Component	What it does
Tracking	Log parameters, metrics, and artifacts for each run
Projects	Package ML code for reproducible runs
Models	Standard model packaging format for multiple deployment targets
Registry	Centralized model store with versioning and staging

Component

What it does

Tracking

Log parameters, metrics, and artifacts for each run

Projects

Package ML code for reproducible runs

Models

Standard model packaging format for multiple deployment targets

Registry

Centralized model store with versioning and staging

import mlflow import mlflow.sklearn mlflow.set_experiment("house-price-predictor") with mlflow.start_run(): mlflow.log_param("alpha", 0.5) mlflow.log_param("model", "ridge") model.fit(X_train, y_train) rmse = compute_rmse(model, X_test, y_test) mlflow.log_metric("rmse", rmse) mlflow.sklearn.log_model(model, "model") print(f"Run ID: {mlflow.active_run().info.run_id}")

Object	Role
Pod	Smallest deployable unit — one or more containers
Deployment	Manages replicas of a Pod, handles rolling updates
Service	Exposes Pods via a stable IP or DNS name
ConfigMap / Secret	Inject config and credentials without rebuilding images
Job / CronJob	Run batch tasks (model training) once or on a schedule
PersistentVolume	Durable storage for datasets and model artifacts

Object

Role

Pod

Smallest deployable unit — one or more containers

Deployment

Manages replicas of a Pod, handles rolling updates

Service

Exposes Pods via a stable IP or DNS name

ConfigMap / Secret

Inject config and credentials without rebuilding images

Job / CronJob

Run batch tasks (model training) once or on a schedule

PersistentVolume

Durable storage for datasets and model artifacts

# Use Python 3.11 slim base FROM python:3.11-slim WORKDIR /app # Install dependencies first (for layer caching) COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy app code COPY . . # Copy model artifact COPY models/model.pkl /app/models/ EXPOSE 8000 CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

name: ML Pipeline on: [push] jobs: train-and-eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: "3.11" } - run: pip install -r requirements.txt - run: python train.py - run: python evaluate.py # exits 1 if model below threshold - run: python deploy.py # pushes to SageMaker or Docker

MLOps & Tools

Apache Airflow

Core Concepts

Code Example

Interactive Notebook

Quiz

Apache Spark & Big Data

Core Abstractions

Code Example

Interactive Notebook

Quiz

MLflow — Experiment Tracking

MLflow Components

Code Example

Interactive Notebook

Quiz

Kubernetes for ML

Key Objects

Interactive Notebook

Quiz

AWS SageMaker

Core Workflow

Key Features

⚠ Watch-outs

Interactive Notebook

Quiz

Docker for ML

ML Dockerfile Pattern

Interactive Notebook

Quiz

CI/CD for ML Models

ML CI/CD Pipeline

GitHub Actions ML Example

Interactive Notebook

Quiz