Scrounged
pullfirst.com runs on scrapes of 50+ permit systems that change without notice. Ops is the control plane that makes that survivable: 50+ jobs scheduled, chained, retried, and audited from one dashboard. Nothing runs by hand.
What it runs
- 50+ scraper, import, and sync jobs, each with typed parameters and its own cron cadence.
- A chainer fires downstream imports the moment upstream collection lands: scrape finishes, import starts, nobody watches it happen.
- Retry policies decide what a failure means before a human has to. Scrapes resume from checkpoints, so a source that dies mid-run costs a resume, not a dataset.
The dashboard
One screen over the whole fleet. Every job, every run, logs streaming live over SSE. Every run keeps its parameters, logs, and outcome.
- Materialization tracking: every table traces back to the run that built it, staleness on display.
- One briefing endpoint summarizes the fleet: what ran, what failed, what’s stale. The first thing checked every morning.
Shipping to production
The pipeline runs locally against a local Postgres; pullfirst.com reads from managed Postgres in the cloud. A branch-swap sync moves finished datasets between them: copy into a fresh branch of the production database, then swap. The site never reads a half-written import.
How it’s built
Python end to end: Flask API, cron scheduling, SSE streaming, Postgres state with materialized views behind the briefing. The dashboard is a Preact app, bundled and served by the ops server itself.
Underneath, the ETL layer the jobs drive: collection, normalization, the address grammar, entity resolution, imports. Local tooling, production data; the same runs that build pullfirst.com.
The hard parts
- Sources break silently. A jurisdiction redesigns its portal and the scrape returns plausible-looking nothing. Hence audit trails, staleness tracking, and a briefing that leads with what’s stale.
- The whole fleet runs on one desktop. Checkpoints and resumable runs mean a crash mid-scrape is an inconvenience, not an incident.