sgr commit

sgr commit [OPTIONS] REPOSITORY

Commit changes to a checked-out Splitgraph repository.

This packages up all changes into a new image. Where a table hasn't been created or had its schema changed, this will delta compress the changes. For all other tables (or if -s has been passed), this will store them as full table snapshots.

When a table is stored as a full snapshot, --chunk-size sets the maximum size, in rows, of the fragments that the table will be split into (default is no splitting). The splitting is done by the table's primary key.

If --split-changesets is passed, delta-compressed changes will also be split up according to the original table chunk boundaries. For example, if there's a change to the first and the 20000th row of a table that was originally committed with --chunk-size=10000, this will create 2 fragments: one based on the first chunk and one on the second chunk of the table.

If --chunk-sort-keys is passed, data inside the chunk is sorted by this key (or multiple keys). This helps speed up queries on those keys for storage layers than can leverage that (e.g. CStore). The expected format is JSON, e.g. {table_1: [col_1, col_2]}

--index-options expects a JSON-serialized dictionary of {table: index_type: column: index_specific_kwargs}. Indexes are used to narrow down the amount of chunks to scan through when running a query. By default, each column has a range index (minimum and maximum values) and it's possible to add bloom filtering to speed up queries that involve equalities.

Bloom filtering allows to trade off between the space overhead of the index and the probability of a false positive (claiming that an object contains a record when it actually doesn't, leading to extra scans).

An example index-options dictionary:



{
    "table": {
        "bloom": {
            "column_1": {
                "probability": 0.01,   # Only one of probability
                "size": 10000          # or size can be specified.
            }
        },
        # Only compute the range index on these columns. By default,
        # it's computed on all columns and is always computed on the
        # primary key no matter what.
        "range": ["column_2", "column_3"]
    }
}

Options

  • -s, --snap: Do not delta compress the changes and instead store the whole table again. This consumes more space, but makes checkouts faster.
  • -c, --chunk-size INTEGER: Split new tables into chunks of this many rows (by primary key). The default value is governed by the SG_COMMIT_CHUNK_SIZE configuration parameter.
  • -k, --chunk-sort-keys JSON: Sort the data inside each chunk by this/these key(s)
  • -t, --split-changesets: Split changesets for existing tables across original chunk boundaries.
  • -i, --index-options JSON: JSON dictionary of extra indexes to calculate on the new objects.
  • -m, --message TEXT: Optional commit message
  • -o, --overwrite: Overwrite physical objects that already exist