Previous Post
SELECT directly from the browser
Oct 9, 2022 · By Artjoms Iškovs
READING TIME: 6 min

Seafowl: a database for analytics at the edge

Our new project: a CDN-friendly analytical database that's up to 10x faster than PostgreSQL and up to 5x faster than Splitgraph.

For the past few months, we've been hard at work on something interesting and it's finally ready for people to start experimenting with.

Seafowl is our new standalone project, inspired by the design of modern cloud data warehouses like Snowflake and BigQuery, as well as the lessons we learned over the past four years of building Splitgraph and the Splitgraph DDN.

It's a CDN and HTTP-cache friendly database designed for analytics at the edge: dashboards, notebooks, visualizations and other data-driven read-intensive Web applications that need to be responsive while operating on large datasets.

curl https://demo.seafowl.io/q/30dcf5da00f94e3fbd28cf98242f7de3b164a440cadfaa6f7de74369ffc2177d.csv \
  -i -XGET -H "Content-Type: application/json" \
  -d@- <<EOF
{"query": "SELECT country_of_production, COUNT(*) AS count \
FROM supply_chains \
GROUP BY 1 ORDER BY 2 DESC"}
EOF

HTTP/2 200
...
cf-cache-status: HIT

{"country_of_production":"BRAZIL","count":2386600}
{"country_of_production":"ARGENTINA","count":260293}
{"country_of_production":"INDONESIA","count":155751}
{"country_of_production":"ECUADOR","count":96842}
{"country_of_production":"PARAGUAY","count":29848}
...

Fast analytics...

Seafowl is written in Rust and relies on Apache DataFusion to plan and execute queries. For analytical queries, it's:

  • up to 10x faster than vanilla PostgreSQL
  • ...or up to 4x faster than PostgreSQL with the parquet_fdw extension
  • ...or around the level of DuckDB in performance, sometimes beating it (apart from some pathological queries where it has memory issues, as well as queries that use a lot of subqueries where the optimizer might not be able to build an optimal query plan)

We have some benchmarks which aren't as in-depth as TPC-DS, but do show that Seafowl makes a lot of seemingly heavy SQL queries now feasible for powering interactive applications.

For SELECT queries, Seafowl supports a large subset of the PostgreSQL dialect. If there's something missing, you can write a user-defined function for Seafowl in anything that compiles to WebAssembly.

You can write data to Seafowl with:

You can also "bake" a dataset into a Docker image by bundling Seafowl together with its data. This is a nifty scaling strategy for smaller (well, up to 1GB, which can be as big as the 55M row subset of the New York Taxi dataset we use for our benchmarks) and read-only data.

...at the edge

"The edge" is a handwavy and marketing-y term, but it describes what Seafowl is all about well enough. You can deploy Seafowl to the edge or cache query results at the edge.

Seafowl ships as a single 50MB binary that starts in 10ms, making it easily deployable to modern serverless environments, anywhere in the world. Our own tutorial uses Fly.io as an example.

Seafowl's query execution API follows HTTP cache semantics. This means you can put Seafowl behind a CDN like Cloudflare or a cache like Varnish and have query results cached and delivered to your users in milliseconds.

Even without a dedicated cache, you can get the benefits of caching query results in your user's browser (to see it in action, go to our demo, press F12, go to the Network tab and refresh the page).

Try it out!

Seafowl is now available to download and experiment with, though, like with all early stage projects, it likely has plenty of bugs and missing features. As we continue its development, we might break APIs or change storage formats in a backwards-incompatible way.

Feel free to check out our Observable demo, go through our quickstart or through a longer tutorial in which you'll deploy Seafowl to Fly.io, put it behind Cloudflare or Varnish and reproduce our Observable demo notebook.

When something breaks, please do raise a GitHub issue. And, if it doesn't (or doesn't break badly enough), give us a GitHub star!

Future features

There are many features we're planning for Seafowl. Where appropriate, we'll also aim to upstream these changes into DataFusion itself.

JSON support

Seafowl currently supports a limited subset of PostgreSQL operators and functions. The most notable missing feature is the JSON data type for storage of arbitrary semi-structured objects, as well as helper functions like json_extract_path / json_agg.

See the GitHub issue for more details.

Live querying

You can currently query Parquet/CSV files over HTTP from Seafowl, but you can't query any other databases "live", like you could with Splitgraph. It's always been difficult for us to write PostgreSQL foreign data wrappers in C, since we'd have to do a lot of work to correctly push down various parts of the PostgreSQL query tree. With the extensibility of DataFusion and the memory safety of Rust, we expect manipulating the query plan to be much easier.

See the GitHub issue for more details.

Advanced PostgreSQL client support

Splitgraph has always been mostly PostgreSQL-first, with great support for PostgreSQL clients (including ones that run complex introspection queries, like Metabase).

With Seafowl, we flipped this upside down: it's HTTP-first, with basic PostgreSQL wire protocol support as an afterthought. Perhaps it would be interesting to polish that more, add more information_schema tables and make Seafowl queryable by existing PostgreSQL clients, dashboarding and BI tools.

This would also open up more data sources, for example, using Airbyte to load data into Seafowl.

...Seafowl Cloud?

No open-source project announcement nowadays is complete without teasing a managed version of it.

In our case, we already have Splitgraph and we're currently working on a better UI for it, one that moves us from obscure concepts of "data images" and "repositories" to more familiar ones like "tables" and "databases". Over time, we're also planning on migrating Splitgraph's query execution engine from PostgreSQL to Seafowl.

With that in mind, we could provide a management interface for your Seafowl instance or, later on, a true multi-tenant pay-for-what-you-use analytical query execution API, powered by Seafowl.

In the words of a certain company, "we're just getting started".

(Ab)using CDNs for SQL queries