Splitgraph has been acquired by EDB! Read the blog post.

Chunking data

Usually, sgr splits large tables into multiple objects. At commit time, it chunks new tables in an image into fragments of 10000 rows by default. You can change this by passing --chunk-size to sgr commit or changing the SG_COMMIT_CHUNK_SIZE configuration flag/environment variable.

By default, sgr chunks by primary key. You can also sort data inside a chunk by passing --chunk-sort-keys to sgr commit. This lets cstore_fdw, sgr's storage backend, use its own skip lists to not read parts of a chunk that aren't interesting to a query, decreasing the amount of disk access.

When you commit a change to a table, sgr represents that change as multiple objects that contain row-level additions/deletions and can partially "overlap" with old objects. sgr appends these new objects to the end of the table's object list.

You can override this behavior with --snap, so that sgr will repackage all changed tables as brand new objects.

You can retain existing chunk boundaries when committing new data by passing --split-changesets to sgr commit. In this case, sgr will split delta-compressed changes according to the original table chunk boundaries. For example, if there's a change to the first and the 20000th row of a table that was originally committed with --chunk-size=10000, this will create 2 fragments: one based on the first chunk and one on the second chunk of the table.

Optimal object sizes

There is a performance and space tradeoff regarding the amount of rows and the physical size of every object (chunk) and hence the amount of objects that a single table gets split up to.

  • Every extra object in a table has multiple kinds of overhead:
    • Fixed storage overhead in sgr's metadata (splitgraph_meta.objects table), including the hashes and the index for that object.
    • Overhead when scanning through the object index at layered query plan time
    • (small) overhead of initializing a scan through the object by PostgreSQL
    • Fixed overhead when checking out the table (more deltas have to be applied to the staging area)
    • Latency overhead for downloading lots of small objects (each of which is a file) over few large ones
  • Datasets with lots of small objects can take better advantage of the object index: in layered querying, sgr might need to download or scan through a smaller fraction of the table to satisfy the query, since the table is more granular.
  • Tables that share a lot of data with other tables have a better chance of reusing the same fragments when stored as smaller objects.

In practice, we have found that targeting object sizes of a few megabytes delivers a good balance. It is not currently possible to get sgr to automatically target a certain object size, but the required number of rows in each object can be approximated by knowing the total image size and the amount of rows in all of its tables.

Rechunking images

An image that includes a lot of changes still holds all of its history, since it contains all the intermediate objects together with deltas. This method allows sgr to perform quicker commits (since a change is appended to the end of a table's list of objects) at the expense of the image bloating up and having lots of dead rows. This is sometimes undesirable. If you have multiple images in a repository that show a dataset's history, you might want to optimize the latest image in that repository to decrease its size and speed up querying. In addition, you might want to be able to delete sensitive data from an image and not have it in the history at all.

To rechunk an image, splitting every table in it into disjoint objects again, check it out and recommit it as a snapshot, adding necessary indexes. For example:

$ sgr checkout my/big_image:latest
$ sgr commit my/big_image --snap --chunk-size 100000

To cleanup "loose" or "dangling" objects that are not linked to any tables, you can use the sgr cleanup command.