Tracking dataset provenance

There's one other advantage to using Splitfiles instead of building images manually: provenance tracking. When you create an image with a Splitfile, the metadata of the image includes the specifics of the actual command that created it.

This means that we can output a list of images and repositories that were used in an image's creation.


$ sgr provenance example/output:latest

example/output:4fd86d48715d2a3e250810fbdaa43d21ed0a894cd00de9bc118d9191e6294741 depends on:

Since the Splitfile in the previous section imported tables from the two generated repositories, Splitgraph considers these repositories to be dependencies of example/output. (Note that you can still check out and use the example/output repository without those two repositories present).

We can also run the dependency query backwards, to find the downstream dependents of a given image:

$ sgr dependents example/repo_1:latest

example/repo_1:103cb2da2da000f3dce9512e3cc67434d7a3c977c0df259411c73a2687244706 is depended on by:

We can also use this provenance data to reconstruct a Splitfile that will rebuild the newly created image, even without access to the original Splitfile itself:

$ sgr provenance example/output:latest --full

# Splitfile commands used to recreate example/output:4fd86d48715d2a3e250810fbdaa43d21ed0a894cd00de9bc118d9191e6294741
FROM example/repo_1:103cb2da2da000f3dce9512e3cc67434d7a3c977c0df259411c73a2687244706 IMPORT demo AS table_1
FROM example/repo_2:97900a8279627ec22b355d1211dd824f78f5840b6a0e30e1d285f4e5c6120016 IMPORT demo AS table_2
  AS SELECT table_1.key
          , table_1.value AS value_1
          , table_2.value AS value_2
     FROM table_1
          INNER JOIN table_2 ON table_1.key = table_2.key}

Note this is a more powerful feature than what Dockerfiles offer (in Docker, the final image is unaware of upstream build contexts). With Splitgraph, the provenance information encodes everything necessary to rebuild an image. This means that we can rebuild the example/output:latest image against a different version of any of its dependencies by simply substituting a different image hash into this regenerated Splitfile:

$ sgr rebuild example/output:latest --against example/repo_2:new_data

Rerunning example/output:4fd86d48715d2a3e250810fbdaa43d21ed0a894cd00de9bc118d9191e6294741 against:

Step 1/3 : FROM example/repo_1:103cb2da2da000f3dce9512e3cc67434d7a3c...
Resolving repository example/repo_1
 ---> Using cache
 ---> 27aeabd37f2c

Step 2/3 : FROM example/repo_2:new_data IMPORT demo AS table_2
Resolving repository example/repo_2
Importing 1 table from example/repo_2:48407a3cef8a into example/output
 ---> ca85929a90e7

Step 3/3 : SQL {CREATE TABLE result  AS SELECT table_1.key        ...
Executing SQL...
Committing example/output...
Processing table result
Computing table partitions
Indexing the partition key
Storing and indexing the table
100%|█████████████████| 1/1 [00:00<00:00,  6.33objs/s]
 ---> 46b6a361bebd

Since this example uses the same image from example/repo_1, Splitgraph is able to cache the first step of the rebuild. But since we altered the example/repo_2 image (by specifying --against example/repo_2:new_data), Splitgraph invalidates the subsequent layers and reruns their build steps. You can think of this invalidation in the same way as Dockerfile cache invalidation.

Let's examine the new result:

$ sgr sql --schema example/output "SELECT * FROM result ORDER BY key"

2       d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35        d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35_UPDATED
3       4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce        4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce_UPDATED
4       4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a        4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a
5       ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d        ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d
6       e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683        e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683
7       7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451        7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451
8       2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3        2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3
9       19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7        19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7

Since the first two rows are now missing from example/repo_2, the JOIN result doesn't have them either.

The Splitfile executor still caches images in this case, so if we do a rebuild against the original version of example/repo_1, Splitgraph will check out the first version of example/output from the cache:

$ sgr rebuild example/output:latest --against example/repo_2:original_data

Rerunning example/output:46b6a361bebd3c16ea23030f91645f683487690b8c695f4585c1b01238ec1d1d against:

Step 1/3 : FROM example/repo_1:103cb2da2da000f3dce9512e3cc67434d7a3c...
Resolving repository example/repo_1
 ---> Using cache
 ---> 27aeabd37f2c

Step 2/3 : FROM example/repo_2:original_data IMPORT demo AS table_2
Resolving repository example/repo_2
 ---> Using cache
 ---> ee4dbe8a8167

Step 3/3 : SQL {CREATE TABLE result  AS SELECT table_1.key        ...
 ---> Using cache
 ---> 4fd86d48715d