I wanted an end-to-end, locally reproducible streaming system that:
- Pulls synthetic user data from a public API,
- Transports it via a durable queue,
- processes it with a streaming engine, and
- persists it in a queryable store.
Building a reproducible real-time pipeline that runs entirely on WSL2 + Docker (9 containers: Airflow web + scheduler + Postgres, Kafka broker + ZooKeeper + Control Center, Spark master + worker, Cassandra)
I wanted an end-to-end, locally reproducible streaming system that:
I chose a well-known “starter” architecture
This project runs like a small real-time mail system.
On a schedule, Airflow calls a public “random user” API to generate realistic test records and drops each one as a message onto Kafka, which acts as a reliable queue so nothing gets lost if parts start or stop.
Spark Structured Streaming continuously reads those messages, cleans and structures the fields (name, email, phone, etc.), and writes the results into Cassandra, a fast database optimized for streaming inserts.
To keep it frictionless, the Spark job auto-creates the Cassandra keyspace/table on first run, and the Airflow task ensures the Kafka topic exists before producing - so a fresh clone “just works.”
You can verify the flow end-to-end by triggering the Airflow task, watching messages appear in Kafka’s UI, and seeing row counts grow in Cassandra.
A fully working local pipeline that ingests synthetic user events, streams them through Kafka, transforms/parses them in Spark, and lands structured rows in Cassandra