Splitgraph has been acquired by EDB! Read the blog post.

Five minute demo

Introduction

This demo will show you how to use sgr to:

  • query a large remote public data image (dataset) with any of your favorite tools, without having to download it completely
  • write a Splitfile that can build a derivative image from it
  • publish your derivative image on Splitgraph.

Prerequisites

You need to register on Splitgraph to pull and publish data. If you followed the installation instructions, you already did this. You can also use sgr cloud register or sgr cloud login to register or log in from the sgr client.

If you do not wish to register on Splitgraph, you can still use sgr in a decentralized fashion, sharing data with other sgr instances. See the decentralized demo for an introduction that showcases importing data from CSV files and tracking changes to images, using sgr's version control functionality.

First, make sure that your sgr engine is initialized and test your connection to Splitgraph. This demo is written from the point of view of the splitgraph-demo user, so replace splitgraph-demo with your own username in command invocations.

$ sgr init
Initializing engine PostgresEngine LOCAL (sgr@localhost:5432/splitgraph)...
Waiting for connection......
Database splitgraph already exists, skipping
Ensuring the metadata schema at splitgraph_meta exists...
Running splitgraph_meta--0.0.1.sql
Running splitgraph_meta--0.0.1--0.0.2.sql
Running splitgraph_meta--0.0.2--0.0.3.sql
Installing Splitgraph API functions...
Installing CStore management functions...
Installing the audit trigger...
Engine PostgresEngine LOCAL (sgr@localhost:5432/splitgraph) initialized.


$ sgr sql -r data.splitgraph.com "SELECT splitgraph_api.get_current_username()"
splitgraph-demo

Pulling a data image

The 2016 US Presidential Election precinct-level returns dataset (source) has 2 million rows, takes up 500MB as a CSV file and only 25MB in Splitgraph. It's a perfect example to showcase sgr's features.

Let's inspect this image remotely without cloning it, using sgr show. The sgr client uses the same protocol to communicate with the registry as with the local sgr engine, so you can use the same commands for both local and remote operations.

$ sgr show -r data.splitgraph.com splitgraph/2016_election:latest
Image splitgraph/2016_election:3835145ada3f07cad99087d1b1071122d58c48783cbfe4694c101d35651fba90

Created at 2019-10-10T15:51:41.122370
Size: 26.81 MiB
No parent (root image)

Tables:
  precinct_results

Let's also inspect the precinct_results table in this image. sgr stores tables as a composition of multiple content-addressable, immutable objects.

$ sgr table -r data.splitgraph.com splitgraph/2016_election:latest precinct_results
Table splitgraph/2016_election:3835145ada3f07cad99087d1b1071122d58c48783cbfe4694c101d35651fba90/precinct_results

Size: 26.81 MiB
Rows: 1989234
Columns:
  year (integer)
  stage (character varying)
  special (boolean)
  state (character varying)
  state_postal (character varying)
  state_fips (bigint)
  state_icpsr (bigint)
  county_name (character varying)
  county_fips (bigint)
  county_ansi (bigint)
  county_lat (numeric)
  county_long (numeric)
  jurisdiction (character varying)
  precinct (character varying)
  candidate (character varying)
  candidate_normalized (character varying)
  office (character varying)
  district (character varying)
  writein (boolean)
  party (character varying)
  mode (character varying)
  votes (bigint)
  candidate_opensecrets (character varying)
  candidate_wikidata (character varying)
  candidate_party (character varying)
  candidate_last (character varying)
  candidate_first (character varying)
  candidate_middle (character varying)
  candidate_full (character varying)
  candidate_suffix (character varying)
  candidate_nickname (character varying)
  candidate_fec (character varying)
  candidate_fec_name (character varying)
  candidate_google (character varying)
  candidate_govtrack (character varying)
  candidate_icpsr (character varying)
  candidate_maplight (character varying)
  id (integer, PK)

Objects:
  o960395135f32c2da7b4b9371da06b0a9d00925472df2ec1ef687acf6f24894
  o07f95f9944d673d8cc05a5b5174253adc513d377ccd811928ae0fe6923e451
  o40e4e46594064e402ccdb914ffb8c9f59267a67b33420cf704073729a6671b
  o1ccf32547f73be8047c3de06760b32c1077c3bd0b0d8309575ebe9ab8eb86c
  oaf3cca3210e1903f06872d0041097dd284423958352d5cd7b5209eceec6cbf
  odc1546c25cac950eca4b551642305d548172b554643f046a16cd9da236c895
  oe60d81ef239980ceb5869fd787d159f227b59b66ab6ceef18aed45cd62ae8f
  o9e77e782f8bda76be6a3f4d29b82661e32d614f7d63230619e8635b9b2b512
  of06f6cfb7bd95047bb49bcd3ab6b98da63c2605aa9659c064cee3bd708b0e4
  o0671aabeb9d4ce362515624616a5a202503db99d7a83007bfdf63fd71b91af
  ...

Now let's actually clone the image. By default, this will only download the image's metadata. sgr will not transfer any actual data until it's required, unless you explicitly ask for it.

$ sgr clone splitgraph/2016_election
Gathering remote metadata...
Fetched metadata for 1 image, 1 table, 20 objects and 1 tag.

Querying the image

To interact with this image, you could check it out, which turns the image into a schema with PostgreSQL tables.

However, imagine if this dataset were a few hundred gigabytes large. Should you really need to download all of it to query a few rows?

Splitgraph objects contain metadata that can help it to discard objects that are irrelevant to a query. The US election dataset is stored ordered by county FIPS code and so queries that filter on this column only need to download a couple of objects instead of the whole dataset.

Let's check out the image into a so called "layered querying" schema. This sets up a Postgres foreign data wrapper for this image that acts as a shim and presents itself as normal Postgres table to clients.

$ sgr checkout --layered splitgraph/2016_election:latest
Checked out splitgraph/2016_election:3835145ada3f.

We're interested in precinct-level election results for District of Columbia (FIPS code 11001).

$ sgr sql -s splitgraph/2016_election "EXPLAIN SELECT \
  candidate_normalized, SUM(votes) FROM precinct_results \
  WHERE county_fips=11001 GROUP BY candidate_normalized"
GroupAggregate  (cost=5698135.18..5699554.45 rows=1 width=64)
Group Key: candidate_normalized
->  Sort  (cost=5698135.18..5698608.27 rows=189234 width=30)
Sort Key: candidate_normalized
->  Foreign Scan on precinct_results  (cost=20.00..5677020.00 rows=189234 width=30)
Filter: (county_fips = 11001)
Multicorn: Objects removed by filter: 18
Multicorn: Scan through 2 object(s) (2.56 MiB)
JIT:
Functions: 7
...

Here, we can see that this query will only download two objects. Let's actually run it: behind the scenes, this will lazily download required data.

$ sgr sql -s splitgraph/2016_election "SELECT \
  candidate_normalized, SUM(votes) FROM precinct_results \
  WHERE county_fips=11001 GROUP BY candidate_normalized"
clinton  282830
in         6551
johnson    4906
stein      4258
trump     12723

sgr sql is a shorthand to run an SQL query against the current engine. However, any PostgreSQL client will work with sgr, allowing you to work with sgr using your favorite tools, including DataGrip, pgAdmin or other clients like pgcli or DBeaver.

You can use sgr config -n to print out a libpq connection string to the current engine.

$ sgr config -n
postgresql://sgr:supersecure@localhost:5432/splitgraph

Using Splitfiles to build derivative images

Splitfiles are similar to Dockerfiles and are the easiest way to build Splitgraph images. They offer Dockerfile-like caching, provenance tracking, fast rebuilds, joins between datasets and full SQL support.

Let's create a Splitfile that summarizes the vote counts for every candidate in every state.

$ cat <<EOF > votes_by_state.splitfile
FROM splitgraph/2016_election IMPORT {
  SELECT candidate_normalized, state_postal,
    SUM(votes) AS total_votes
  FROM precinct_results
  GROUP BY candidate_normalized,
    state_postal
} AS votes_by_state
EOF

Build this image. Even though we have it pulled locally, this isn't necessary to build it: sgr downloads required image fragments automatically.

$ sgr build votes_by_state.splitfile
Executing Splitfile votes_by_state.splitfile with arguments {}

Step 1/1 : FROM splitgraph/2016_election IMPORT {  SELECT candidate...
Resolving repository splitgraph/2016_election
Gathering remote metadata...
No image/object metadata to pull.
Importing 1 table from splitgraph/2016_election:3835145ada3f into votes_by_state
Processing table sg_tmp_32eba453c2f1dd1a13ca58ae3b901a47
 ---> ee8dc4bb7c47
Successfully built votes_by_state:ee8dc4bb7c47.

Take a look at the new image and run a query against it.

$ sgr table votes_by_state:latest votes_by_state
Table votes_by_state:ee8dc4bb7c4766cc30593b77e0b6b3dbb2863c06a4ed2274127d558901abeb2f/votes_by_state

Size: 6.38 KiB
Rows: 793
Columns:
  candidate_normalized (character varying)
  state_postal (character varying)
  total_votes (numeric)

Objects:
  o91e4939e520a19c02ca876c7c5248dcf96f9c3fa55552c6ed8359caea842e9


$ sgr sql -s votes_by_state "SELECT * FROM votes_by_state LIMIT 10"
FL  72993
         IN    603
         MD  22131
         ND   6397
         PA  39004
         RI   9439
         WY   6904
above    NY      1
abraham  NY      0
adams    MD     44

sgr caches the outputs of Splitfiles, in the same way that Docker caches Dockerfile layers. It does not rebuild the image if the source data has not changed.

$ sgr build votes_by_state.splitfile
Executing Splitfile votes_by_state.splitfile with arguments {}

Step 1/1 : FROM splitgraph/2016_election IMPORT {  SELECT candidate...
Resolving repository splitgraph/2016_election
Gathering remote metadata...
No image/object metadata to pull.
 ---> Using cache
 ---> ee8dc4bb7c47
Successfully built votes_by_state:ee8dc4bb7c47.

Splitfiles also support joins between Splitgraph images. For an example, or to learn more about Splitfiles, check out the Splitfile reference or some sample Splitfiles.

Publishing a dataset on Splitgraph

Let's push your image to the Splitgraph registry, making it available publicly. By default, sgr pushes the image to a repository with the same name in your own namespace. In the case of this demo user, that means sgr pushes the freshly built dataset to splitgraph-demo/votes_by_state.

$ sgr push votes_by_state
Pushing votes_by_state to splitgraph-demo/votes_by_state on remote data.splitgraph.com
Gathering remote metadata...
No objects to upload.
Uploaded metadata for 2 images, 1 table, 0 objects and 0 tags.
Setting upstream for votes_by_state to splitgraph-demo/votes_by_state.


$ sgr show -r data.splitgraph.com splitgraph-demo/votes_by_state:latest
Image splitgraph-demo/votes_by_state:ee8dc4bb7c4766cc30593b77e0b6b3dbb2863c06a4ed2274127d558901abeb2f
Importing 1 table from splitgraph/2016_election
Created at 2020-05-22T18:07:44.828748
Size: 6.38 KiB
Parent: 0000000000000000000000000000000000000000000000000000000000000000

Tables:
  votes_by_state

sgr lets you upload a README and a description for your dataset directly from the CLI. Create a README and a metadata file.

$ cat <<EOF > dataset-readme.md
## votes_by_state

This dataset summarizes votes by state in the 2016 US Presidential Election.

This is also my first dataset (woohoo!)
EOF


$ cat <<EOF > splitgraph.yml
readme: dataset-readme.md
description: US 2016 Election votes summary by state.
EOF

Upload the metadata to Splitgraph (replace splitgraph-demo with your username).

$ sgr cloud metadata splitgraph-demo/votes_by_state
README updated for repository splitgraph-demo/votes_by_state.
Description updated for repository splitgraph-demo/votes_by_state.

The dataset is now available on https://www.splitgraph.com/YOUR_USERNAME/votes_by_state.

sgr also uploads the image's provenance to Splitgraph, allowing anyone to see how the image was built and sourced. In addition, the 2016 Election dataset dependents page will now list the new image as a dependent.

You also have an OpenAPI-compatible REST API generated for the dataset, provided by PostgREST.

Querying the dataset through the Splitgraph Data Delivery Network

Splitgraph Data Delivery Network is a public SQL endpoint that lets any client query any dataset that is hosted on or proxied by Splitgraph. You can run SQL against your data immediately with sgr cloud sql:

$ sgr cloud sql 'SELECT * FROM "YOUR_USERNAME/votes_by_state".votes_by_state'

You can also run sgr cloud sql with no parameters to get a libpq connection string that you can pass to other PostgreSQL clients. For example:

$ pgcli $(sgr cloud sql)

Further reading