Previous Post
Rust visitor pattern and efficient DataFusion query federation
Jan 12, 2023 · By Patrick Skinner
READING TIME: 6 min

Open Data Monitor

How can we track open government datasets over time? Say hello to Open Data Monitor, a Socrata tracking tool powered by Seafowl and Splitgraph.

Tracking open government datasets

Say hello to Open Data Monitor, an open government dataset diff visualizer. Observe when Socrata datasets were added/deleted by day, week, and month, with upstream attribution and easy querying via the respective Splitgraph repository.

In a previous post we discussed the benefits of Seafowl's approach to HTTP caching; in particular its effective use of ETags to deliver results on the edge with excellent latency characteristics. Today, we continue on our journey to:

  • Ingest Splitgraph's Socrata catalog into Seafowl
  • Insert tables that expose added/deleted status ("observe" Socrata)
  • Swap Fly.io instances in a cost-effective way
  • Build a mini Next.js app to render it all out in the browser, making sure to balance the best of SSR SEO benefits with Seafowl's browser-friendly cache semantics.

Ingestion

In case you didn't already know, Splitgraph maintains a catalog of Socrata datasets, stored as a Splitgraph repository at splitgraph/socrata. Additionally, Splitgraph actually indexes all of the 40,000+ datasets and makes each one queryable at its own URL (like cdc-gov/20202021-nationwide-blood-donor-seroprevalence-mtc3-kq6r) and Splitgraph repository on the Data Delivery Network (DDN). It's possible to point your favorite Postgres client to the DDN and query arbitrary Socrata datasets in a familiar way. From traffic accidents in Fort Worth, TX 1, to taxi rides as a proxy for "how's NYC Covid bounce back going"2, we've built several integrations that present data in visual and interactive ways without custom plumbing. We also made Socrata roulette for generating random queries against all of the available datasets.

As useful as the DDN is, today we will demonstrate how Seafowl can make for an effective alternative that performs with low latency and generally has a smaller resource footprint. First, by starting with this catalog, we fetch all Socrata images courtesy of Splitgraph's GQL API. (Conveniently we can export directly in Parquet format!) Then, we insert the results into Seafowl tables. This whole job is powered by Github Actions, and occurs on a regular basis (currently nightly). After we've ingested the new state of all Socrata datasets, we pre-compute a diff across time of all datasets that were added or removed on a given day. We also downsample these diffs to week-over-week and month-over-month changes.

Because this involves a self-join of a table with several million rows, this currently requires a large amount of memory as you'll see later.

Dynamically scaling our Fly.io instance for ingestion

Similarly to the Seafowl tutorial, we again went with Fly.io for hosting. While the free tier generally offers enough performance for Open Data Monitor's read-heavy workload, when it comes to ingestion and building the day-over-day diff table, this time we ran into memory issues because of a self-join. It is perhaps unsurprising that as Socrata has grown over time, so have the number and size of datasets on offer. That's why we added a step in the GHA to temporarily scale up the Fly.io instance memory for ingestion purposes. We also include an insurance policy to help reduce risk of job errors causing unexpected cloud costs.

Rendering in the browser

With a loaded Seafowl instance ready to go, we can now build a Next.js app so we can fetch from Seafowl and render it in the browser.

By default, we render the current week, but the user can specify a certain day, week, or month.

Disappearing data examples

Because traversing over time is now trivial, we can surface curious moments.

For example, just a few clicks moving backwards from January 2023 yielded disappearing
detective smoking a pipe

SSR by default, but browser talks to Seafowl directly on nav

To strike a balance between SSR-conferred SEO advantages, and Seafowl's HTTP cache friendly defaults, we pre-fetch inside getInitialProps, which means the page renders properly even with JS disabled. But since we don't want the Vercel server to intermediate subsequent fetches, when the user clicks Previous/Next, their browser requests from Seafowl directly. Because we are using the GET API, we benefit from the nice cache behavior, so subsequent loads are nice and zippy, even on Fly.io's free tier, and even if the actual ingest was a memory hungry job.

Read the code

The code for all of this is available on GitHub, in two repositories:

Conclusion

We hope this post has illustrated a pathway to combine Seafowl's "analytics at the edge" benefits with Socrata's bevy of government data, in a cost-effective way. While today we focused on open government data, regardless of the data you're working with, consider checking out what Seafowl offers.

If you discover any curious dataset mysteries, or build an integration with Seafowl, we'd love to hear from you. Tweet us @splitgraph.

Happy sleuthing!

Image credits

mediamodifier, sammywilliams, ayosake