Scrounged

2026 · self-hosted · runs pullfirst.com

pullfirst.com runs on scrapes of 50+ permit systems that change without notice. Ops is the control plane that makes that survivable: 50+ jobs scheduled, chained, retried, and audited from one dashboard. Nothing runs by hand.

What it runs

50+ scraper, import, and sync jobs, each with typed parameters and its own cron cadence.
A chainer fires downstream imports the moment upstream collection lands: scrape finishes, import starts, nobody watches it happen.
Retry policies decide what a failure means before a human has to. Scrapes resume from checkpoints, so a source that dies mid-run costs a resume, not a dataset.

PullFirst ops fleet timeline: 30 days of scraper and import runs across every job, with per-job health badges — fig. 1 · the fleet: every job, every run, 30 days · [zoom]

The dashboard

One screen over the whole fleet. Every job, every run, logs streaming live over SSE. Every run keeps its parameters, logs, and outcome.

Materialization tracking: every table traces back to the run that built it, staleness on display.
One briefing endpoint summarizes the fleet: what ran, what failed, what’s stale. The first thing checked every morning.

A finished scraper run in the ops dashboard: outcome, parameters, delta against the previous run, and its full stored logs — fig. 2 · one run: parameters, delta vs previous, stored logs · [zoom]

Shipping to production

The pipeline runs locally against a local Postgres; pullfirst.com reads from managed Postgres in the cloud. A branch-swap sync moves finished datasets between them: copy into a fresh branch of the production database, then swap. The site never reads a half-written import.

How it’s built

Python end to end: Flask API, cron scheduling, SSE streaming, Postgres state with materialized views behind the briefing. The dashboard is a Preact app, bundled and served by the ops server itself.

Underneath, the ETL layer the jobs drive: collection, normalization, the address grammar, entity resolution, imports. Local tooling, production data; the same runs that build pullfirst.com.

The hard parts

Sources break silently. A jurisdiction redesigns its portal and the scrape returns plausible-looking nothing. Hence audit trails, staleness tracking, and a briefing that leads with what’s stale.
The whole fleet runs on one desktop. Checkpoints and resumable runs mean a crash mid-scrape is an inconvenience, not an incident.