Metadata vs data

When you clone an image, or the Splitfile executor encounters one you referenced in a Splitfile, Splitgraph only downloads the metadata for the repository (or the referenced image within the repository). The metadata consists of all Splitgraph state tables in splitgraph_meta: images, objects, tags and tables. Compared to the actual data, it's very lightweight, most of its footprint being taken up by the object index.

When you check out an image, Splitgraph downloads the actual data, which is the set of objects required by that image. When you query a table from a layered checkout, Splitgraph only downloads objects that are required to satisfy that query, as determined by the object index.

Splitgraph considers objects downloaded from a remote engine to be "cached," and can evict them if the free cache space is limited. You can control the cache size by the SG_OBJECT_CACHE_SIZE variable in .sgconfig. The default value is 10240 (10GB).

With layered querying, you can see how much data a query will require before you execute it, by running EXPLAIN on the query. For example, in the 2016 US Presidential Election precinct-level returns dataset:

$ sgr sql -s splitgraph/2016_election "EXPLAIN SELECT candidate_normalized, SUM(votes) FROM precinct_results WHERE county_fips=11001 GROUP BY candidate_normalized"

GroupAggregate  (cost=71991481.18..71992900.45 rows=1 width=64)
  Group Key: candidate_normalized
  ->  Sort  (cost=71991481.18..71991954.27 rows=189234 width=380)
        Sort Key: candidate_normalized
        ->  Foreign Scan on precinct_results  (cost=20.00..71908920.00 rows=189234 width=380)
              Filter: (county_fips = 11001)
              Multicorn: Objects removed by filter: 18
              Multicorn: Scan through 2 object(s) (2.55 MiB)
  Functions: 7

This query will need to scan (and download, if it's not already locally cached) through 2.5 MiB of data.