Metadata vs data
When you clone an image, or the Splitfile executor encounters one you referenced
in a Splitfile,
sgr only downloads the metadata for the repository (or the
referenced image within the repository). The metadata consists of all
state tables in
Compared to the actual data, it's very lightweight, most of its footprint being
taken up by the object index.
When you check out an image,
sgr downloads the actual data, which is the set
of objects required by that image. When you query a table from a
sgr only downloads objects that are
required to satisfy that query, as determined by the object index.
sgr considers objects downloaded from a remote engine to be "cached," and can
evict them if the free cache space is limited. You can control the cache size by
SG_OBJECT_CACHE_SIZE variable in
.sgconfig. The default value is 10240
With layered querying, you can see how much data a query will require before you
execute it, by running
EXPLAIN on the query. For example, in the
2016 US Presidential Election precinct-level returns
$ sgr sql -s splitgraph/2016_election "EXPLAIN SELECT candidate_normalized, SUM(votes) FROM precinct_results WHERE county_fips=11001 GROUP BY candidate_normalized" GroupAggregate (cost=71991481.18..71992900.45 rows=1 width=64) Group Key: candidate_normalized -> Sort (cost=71991481.18..71991954.27 rows=189234 width=380) Sort Key: candidate_normalized -> Foreign Scan on precinct_results (cost=20.00..71908920.00 rows=189234 width=380) Filter: (county_fips = 11001) Multicorn: Objects removed by filter: 18 Multicorn: Scan through 2 object(s) (2.55 MiB) JIT: Functions: 7 ...
This query will need to scan (and download, if it's not already locally cached) through 2.5 MiB of data.