Real Time Sign Up Health Monitor

Impact

Building a reproducible real-time pipeline that runs entirely on WSL2 + Docker (9 containers: Airflow web + scheduler + Postgres, Kafka broker + ZooKeeper + Control Center, Spark master + worker, Cassandra)

Problem Statement

I wanted an end-to-end, locally reproducible streaming system that:

Pulls synthetic user data from a public API,
Transports it via a durable queue,
processes it with a streaming engine, and
persists it in a queryable store.

Constraints: run on Windows 10 via WSL, no managed services, and keep the setup beginner-friendly.

Approach

I chose a well-known “starter” architecture

Airflow orchestrates and produces JSON events to Kafka (users_created topic)
Spark Structured Streaming reads Kafka, parses JSON, and writes to Cassandra.
Cassandra stores the final table spark_streams.created_users
Control Center gives me Kafka visibility; Postgres holds Airflow metadata; ZooKeeper coordinates Kafka.

Methodology

This project runs like a small real-time mail system.

On a schedule, Airflow calls a public “random user” API to generate realistic test records and drops each one as a message onto Kafka, which acts as a reliable queue so nothing gets lost if parts start or stop.

Spark Structured Streaming continuously reads those messages, cleans and structures the fields (name, email, phone, etc.), and writes the results into Cassandra, a fast database optimized for streaming inserts.

To keep it frictionless, the Spark job auto-creates the Cassandra keyspace/table on first run, and the Airflow task ensures the Kafka topic exists before producing - so a fresh clone “just works.”

You can verify the flow end-to-end by triggering the Airflow task, watching messages appear in Kafka’s UI, and seeing row counts grow in Cassandra.

Result

A fully working local pipeline that ingests synthetic user events, streams them through Kafka, transforms/parses them in Spark, and lands structured rows in Cassandra

Real Time Signup Health Monitor

Impact

Problem Statement

Approach

Methodology

Result