Introduction

Splitfiles are similar to Dockerfiles: each command produces a new commit with a deterministic hash that depends on the current hash and the particulars of a command that's being executed.

Preprocessing

Splitgraph does some quality-of-life preprocessing to the file before interpreting it, so that:

  • Newlines can be escaped to make a command multiline ("\n" gets replaced with "")
  • Parameters are supported (${PARAM} is replaced with the value of the parameter that's either passed to the execute_commands as a dict or to the commandline sgr build as a series of arguments (-a key1 val1 -a key2 val2...)).

Supported commands

The Splitfile executor currently supports the following commands:

  • IMPORT: import data from other Splitgraph images or Postgres schemata (including FDW mounts)
  • SQL: run SQL statements referencing data in the current image or other Splitgraph images.
  • FROM: derive images from other images or perform multistage builds.

It is also possible to add custom commands to the Splitfile executor. They also follow the same caching rules but are currently not supported by provenance tracking.

Note on Splitfile safety

The Splitfile executor validates SQL in IMPORT and SQL commands before passing it to PostgreSQL for execution. This filters out most PostgreSQL syntax constructs that Splitfiles cannot use or references to system tables. However, this shouldn't be relied on for security. In particular, Splitfile validation is currently not available on Windows systems.

Always check Splitfiles before running them and check provenance of datasets you pulled from the Internet (with sgr provenance --full) before running sgr rebuild, as rebuilding runs SQL from the image's metadata, which could be arbitrary and malicious.

Repository lookups

Currently, a repository name is resolved as follows:

  • See if it exists locally. If it does, try to pull it (to update) and use it for FROM/IMPORT commands.
  • If not, see if it's specified in the SG_REPO_LOOKUP_OVERRIDE parameter which has the format repo_1:user:pwd@host:port/db,repo_2:user:pwd@host:port/db.... Return the matching connection string directly without testing to see that the repository exists there.
  • If not, scan the SG_REPO_LOOKUP parameter which has the format user:pwd@host:port/db,user:pwd@host:port/db..., stopping at the first remote that has it.