Layered querying

Splitgraph tries to not redo work that the database can do more efficiently. You can check out a Splitgraph image into native PostgreSQL tables, so you can create necessary indexes to maximize performance for specific queries.

However, this workflow isn't suitable for exploratory queries on large datasets where checking out an image takes too long. This is why Splitgraph supports querying images directly without materializing them into PostgreSQL tables.

This is done by so-called layered querying. Instead of loading an image and its data into PostgreSQL tables, Splitgraph "checks out" an image into a schema consisting of foreign tables. They still present themselves as normal tables to any application interacting with the engine, but behind the scenes, the engine-side of the Splitgraph library interprets them. In particular, this lets Splitgraph inspect object metadata and determine which objects it needs to download to satisfy the query on the fly.

Layered querying takes its name from the way Docker works, where an application running inside of a container thinks that it's interacting with a root filesystem, but is actually interacting with an OverlayFS mount that directs reads to the relevant layer.

Splitgraph uses a fork of Multicorn, an extension for writing foreign data wrappers (FDWs) in Python. This fork includes changes so that no data actually passes through the Python runtime. Splitgraph downloads required objects and materializes required table regions. The FDW performs scans using native C code, which means layered querying competes with and sometimes beats PostgreSQL on query performance.

Layered querying is currently read-only. The Splitfile executor uses layered querying to satisfy IMPORT and SQL commands that read from other Splitgraph images.

Layered checkouts

To use layered querying, check out a pulled dataset with the --layered flag. For example:

$ sgr checkout --layered splitgraph/2016_election:latest

You do not need to change any application that queries this schema or download any objects: Splitgraph will do it lazily when a client queries the table.

When to use layered querying

Depending on the use case, the dataset and the query, layered querying can be faster or slower than querying a native PostgreSQL table.

Layered querying (sgr checkout --layered) should work better for your use case when:

  • Your queries are read-only. Layered querying doesn't support writes.
  • You are querying a dataset for which you only have cloned some metadata and your query is only going to need a small part of the dataset. For example, if you're filtering on a single value for the primary key or have an equality filter on a column that has bloom filtering enabled, Splitgraph will only download a few objects to satisfy the query. You can check the amount of data that a query will download by running it with EXPLAIN first.
  • You have limited disk space. cstore_fdw has a massive space advantage over PostgreSQL tables because it organizes data into columns, which lends itself well to compression. This also means that full table scans are going to be faster and have a smaller IO load in Splitgraph.

You should instead check an image out into a PostgreSQL schema (sgr checkout) if:

  • You want to modify it.
  • You want to query images that include a lot of upstream changes. Instead of forwarding queries to objects directly, layered querying materializes subsets of objects that overlap into a temporary table, the overhead of which might be bigger than savings from scanning through less data. You can also consider rechunking the image to make it easier to query.
  • You want to extract maximum read performance from the image by using PostgreSQL indexes. After you check out the image, it becomes a set of normal PostgreSQL tables that you can create any index on.
  • You are intending to run multiple small queries against the dataset. Splitgraph's query planning has a considerable latency overhead over PostgreSQL's and so this kind of workload will be more expensive than with native PostgreSQL tables.

You can find some Jupyter notebooks with benchmarks comparing layered querying and querying native PostgreSQL tables on GitHub.