Apache Airflow
Workflow orchestration for data pipelines and ML jobs. Define pipelines as Directed Acyclic Graphs (DAGs) in Python. Schedule, monitor, and retry jobs from a web UI.
Core Concepts
| Concept | Meaning |
|---|---|
| DAG | Directed Acyclic Graph — your pipeline definition |
| Task | A single unit of work (PythonOperator, BashOperator, etc.) |
| Operator | Template for a task type — Python, Bash, SQL, HTTP, etc. |
| Scheduler | Parses DAGs and triggers runs based on schedule intervals |
| XCom | Cross-communication — pass small values between tasks |
| Sensor | Waits for a condition (file exists, API ready, time elapsed) |
Code Example
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def train_model(**ctx):
# load data, train, save model
print("Training...")
def evaluate_model(**ctx):
print("Evaluating...")
with DAG("ml_pipeline",
schedule="0 2 * * *", # daily at 2am
start_date=datetime(2026, 1, 1)) as dag:
train = PythonOperator(task_id="train", python_callable=train_model)
eval_ = PythonOperator(task_id="eval", python_callable=evaluate_model)
train >> eval_ # train runs first