Inspecting images

Inspect image metadata using the sgr show command:

$ sgr show namespace/repository:image_hash

Inspect tables in a given image using the sgr table command:

$ sgr table namespace/repository:image_hash table_name

Inspect object metadata with the sgr object command:

$ sgr object object_id

These commands require you to clone the repository (but not its actual data). It's also possible to inspect the metadata on a remote repository without cloning any images by passing the -r or --remote flag to these commands:

$ sgr show -r splitgraph/geonames:latest
Image splitgraph/geonames:0b77a102cbabe2de8597dcab71f1c333f16f9e1963c9c8982988e10e65989c62

Created at 2019-12-11T12:20:08.447974
Size: 539.95 MiB
Parent: 0000000000000000000000000000000000000000000000000000000000000000



Let's further inspect the object that the first example image, example/repo_1, used.

$ sgr object o26c6d8345cba276f807d7bcf906531568f309c2609a3420d98c01a6c99b166

Object ID: o26c6d8345cba276f807d7bcf906531568f309c2609a3420d98c01a6c99b166

Namespace: example
Format: FRAG
Size: 963.00 B
Created: 2020-04-06 10:17:15.650187
Rows inserted: 10
Insertion hash: 37be08d7c1aca256f5e64860afe7db21e6f7c47b00cdf23f65b84576bc209f41
Rows deleted: 0
Deletion hash: 0000000000000000000000000000000000000000000000000000000000000000
Column index:
  key: [0, 9]
  value: ['19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7', 'ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d']

There are a few interesting things in this object's metadata. Firstly, it's stored in the FRAG format. A Splitgraph table consists of multiple fragments that can partially overwrite each other.

Every fragment also has a column index: this is a metadata entry showing, for every column, the minimum and maximum values that this fragment affects (deletes or inserts). This is used when querying large remote Splitgraph tables to not download fragments that definitely don't match a given query. For qualifiers on sorted columns (like the primary key), this can mean only downloading one or two fragments to satisfy a query to a table composed of hundreds.

Finally, every fragment also has an insertion and a deletion hash: the insertion hash is the hash of all the rows that this fragment inserts and the deletion hash is the hash of all the rows that this fragment deletes. These hashes are homomorphic (every row is hashed individually and then the hashes are summed up) and are similar to Facebook's LtHash. They have a really important property: the sum of all hashes of individual fragments composing a table is equal to the content hash of a whole table. This is used in deduplicating data and optimizing storage.

Splitgraph generates the deterministic object ID by combining the content hash (insertion - deletion hash) and the hash of the schema of a given object. This means that Splitgraph fragments are content-addressable.

Now, let's look at the second object that the new version of the demo table in example/repo_2:new_data is linked to.

$ sgr object o4882c12cdc4cb5b4e89481ebdb71585998633f19a72652debbb80980ae0f0b

Object ID: o4882c12cdc4cb5b4e89481ebdb71585998633f19a72652debbb80980ae0f0b

Namespace: example
Format: FRAG
Size: 552.00 B
Created: 2020-04-06 10:18:11.466667
Rows inserted: 4
Insertion hash: aa15011c4af8e7afb877b5de68ab782b69d63720ba5787bec54fbcd6838fb377
Rows deleted: 4
Deletion hash: 7463a64e15187558e6e1096368033dfab422b34503c20e94a83b7a3e4650df68
Column index:
  key: [0, 11]
  value: ['4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5', 'd4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35_UPDATED']

Location: created locally

As you can see, this new object is based on the object that was used by the very first version of the data we generated (o4882c12cdc4cb5b4e89481ebdb71585998633f19a72652debbb80980ae0f0b). This is because the demo data is static and so the first version of the demo table in example/repo_1 and example/repo_2 is in fact the same.

Since this object updates and deletes some rows, their values are included in its deletion hash, which is now non-zero.

Finally, the column index now spans all the values of the key column that were replaced or added. Hence, when there's a query against this table that is inside that range, both fragments will be fetched to check if they match that query.

We can also inspect the actual object: every object in the Splitgraph cache is stored as a CStore columnar storage file.

$ sgr sql "SELECT * FROM splitgraph_meta.o4882c12cdc4cb5b4e89481ebdb71585998633f19a72652debbb80980ae0f0b"

2       d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35_UPDATED        True
3       4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce_UPDATED        True
10      4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5        True
11      4fc82b26aecb47d2868c4efbe3581732a3e7cbcc6c2efb32062c08170a05eeb8        True
0       None    False
1       None    False

Splitgraph stores the object with the same schema as the original table, plus a Boolean flag showing whether a row has been updated/inserted (upserted) or deleted. For deleted rows, Splitgraph only records the primary key, and pads the rest of the table with NULLs.

Splitgraph does not always store objects in the engine. When you push an image to an external location, like Splitgraph Cloud, Splitgraph usually uploads them to an S3-compatible storage location. Splitgraph only downloads the actual objects and loads them into Postgres when you check out an image containing the objects.