splitgraph.yml reference

Mixins

You can optionally split up your splitgraph.yml file into multiple, similar to Docker Compose's override functionality. This allows you, for example, to keep credentials separate from the repository definitions and not check them into source control or inject them at runtime using your CI platform's secrets' functionality.

To reference multiple files, pass several -f flags to sgr cloud commands that expect a splitgraph.yml file:

sgr cloud load -f splitgraph.yml -f splitgraph.credentials.yml

You can also output the full merged configuration by running sgr cloud validate.

Note that currently, each separate file has to be a self-contained valid project. This means that in some cases, you will need to repeat the same configuration for a repository in multiple files (for example, when overriding repository parameters).

splitgraph.yml format reference

credentials

This section defines credentials that are referenced by specific data source plugins in the repositories section.

Example:

credentials:
  csv: # This is the name of this credential that "external" sections can reference.
    plugin: csv
    # Credential-specific data matching the plugin's credential schema
    data:
      s3_access_key: ""
      s3_secret_key: ""

.<credential_name>.plugin

ID of the plugin this credential is for. You can't reuse a credential from one plugin in another, but you can reuse credentials between different repositories that use the same plugin.

.<credential_name>.data

Credential-specific data. This must match the plugin's credentials JSONSchema. You can use sgr cloud stub to generate a value for this section that the plugin will accept.

repositories

This section defines a list of repositories to add/update in Splitgraph, as well as their metadata (README, topics, dataset license etc.) and data source settings (plugin, connection parameters, ingestion schedule).

repositories[*].namespace

Namespace name to set up this repository in.

repositories[*].repository

Name of the repository.

repositories[*].external

Defines configuration for an "external", that is, the external data source settings for a given repository. This section is optional.

.credential_id

UUID of the credential for this plugin to reference. Must be already set up on Splitgraph in a previous run. This field is output by sgr cloud dump and is usually not useful if you're writing a splitgraph.yml file from scratch.

.credential

Name of the credential for this plugin to reference, if it requires credentials. You must either:

  • define this credential in the credentials section (required for sgr cloud sync)
  • have this named credential already set up on Splitgraph in a previous run, using sgr cloud load or through the GUI
.is_live

Whether to enable live querying for plugins like postgres, snowflake, elasticsearch, csv that are based on foreign data wrappers and support it. If this is enabled, Splitgraph will create a "live" tag in this repository that you will be able to reference to query data at source without loading it.

.plugin

ID of the plugin used by this repository, for example, dbt or snowflake. To list all available plugins, run sgr cloud plugins.

.params

Plugin-specific parameters that apply for the whole repository. Must match the plugin's JSONSchema. Like with the credentials section, sgr cloud stub generates a sample value for this field.

Example:

params:
  connection: # Choose one of:
    - connection_type: http # REQUIRED. Constant
      url: "" # REQUIRED. HTTP URL to the CSV file
    - connection_type: s3 # REQUIRED. Constant
      s3_endpoint: "" # REQUIRED. S3 endpoint (including port if required)
      s3_bucket: "" # REQUIRED. Bucket the object is in
      s3_region: "" # Region of the S3 bucket
      s3_secure: false # Whether to use HTTPS for S3 access
      s3_object: "" # Limit the import to a single object
      s3_object_prefix: "" # Prefix for object in S3 bucket
  autodetect_header: true # Detect whether the CSV file has a header automatically
  autodetect_dialect: true # Detect the CSV file's dialect (separator, quoting characters etc) automatically
  autodetect_encoding: true # Detect the CSV file's encoding automatically
  autodetect_sample_size: 65536 # Sample size, in bytes, for encoding/dialect/header detection
  schema_inference_rows: 100000 # Number of rows to use for schema inference
  encoding: utf-8 # Encoding of the CSV file
  ignore_decode_errors: false # Ignore errors when decoding the file
  header: true # First line of the CSV file is its header
  delimiter: "," # Character used to separate fields in the file
  quotechar: '"' # Character used to quote fields

If sgr cloud stub outputs a list of options with a "Choose one of" comment, you should fill out one of the items in the list. For example:

params:
  connection:
    connection_type: s3 # REQUIRED. Constant
    s3_endpoint: "" # REQUIRED. S3 endpoint (including port if required)
    s3_bucket: "" # REQUIRED. Bucket the object is in
    s3_region: "" # Region of the S3 bucket
    s3_secure: false # Whether to use HTTPS for S3 access
    s3_object: "" # Limit the import to a single object
    s3_object_prefix: "" # Prefix for object in S3 bucket
  autodetect_header: true
  # ...
.tables

Tables to be created in repository by ingestion jobs and in the "live" tag if is_live is enabled.

You can omit this setting by setting it to {} (empty dictionary). This will make the plugin introspect the available tables when you run sgr cloud load or sgr cloud sync. In addition, you can run sgr cloud dump to output the current settings, including inferred tables and their schemas.

.tables.<table_name>

Settings for a given table.

.tables.<table_name>.options

Plugin-specific parameters that apply to the table, matching the plugin's table JSONSchema. Depending on the plugin, they might be separate parameters that only apply to a table or an override of global repository parameters.

Example for the csv plugin:

options:
  url: "" # HTTP URL to the CSV file
  s3_object: "" # S3 object of the CSV file
.tables.<table_name>.schema

Schema of the table (description of columns and their types). Note that currently, a lot of plugins don't support overriding the column names and schemas.

.tables.<table_name>.schema[*].name

Column name, 63 characters or fewer. You can use any characters here but, since Splitgraph uses PostgreSQL, lowercase ASCII names with underscores instead of spaces work best for querying, since you don't have to quote them.

.tables.<table_name>.schema[*].pg_type

Type of the column (see the [PostgreSQL documentation](https://www. postgresql.org/docs/current/datatype.html) for reference). This only works for plugins that support live querying, as they are backed by PostgreSQL foreign data wrappers, letting PostgreSQL cast them at runtime.

.tables.<table_name>.schema[*].comment

Comment on the column.

.schedule

Run ingestion for this data source on a schedule. This creates a new "image" in the repository on every run. This is only required if you're using Splitgraph to schedule and orchestrate your ingestion jobs. As an alternative, you can run sgr cloud sync from GitHub Actions or GitLab CI to trigger Splitgraph jobs on a schedule and track their state.

Example:

schedule:
  schedule: "0 */6 * * *"
  enabled: true
.schedule.schedule

Schedule to run ingestion on, in the Cron format. Only one ingestion job for a given repository can be running at a time. This means that if a job is still in progress while it's time to run the next job, the scheduler will wait until the first job finishes.

.schedule.enabled

Flag to enable/disable the ingestion job.

repositories[*].metadata

This section defines various catalog attributes of the repository that aren't relevant to ingestion but are useful for discoverability and organizing your dataset. Splitgraph will display these on the repository's overview page.

Example:

metadata:
  topics:
    - analytics
    - raw
    - postgres
    - normalization:none
  description: Raw analytics data
  sources:
    - anchor: Internal company wiki
      href: https://www.example.com/wiki/data-stack/postgres
  extra_metadata:
    data_source:
      source: Postgres
      normalization: none
  readme:
    text: |
      ## Raw data for analytics

      Sample README for a dataset
.readme

Main body of documentation for the dataset. You can use Markdown formatting. This can be a file path or a dictionary with a README string (see below).

Example:

readme:
  text: |
    ## Raw data for analytics

    Sample README for a dataset
.readme.file

Path to a file. This is the format produced by sgr cloud dump. sgr cloud commands prepend ./readmes to this path when dumping or loading files. To point this path to the README in the repository's root:

  • make an readmes directory with an empty .gitkeep file in it
  • set .readme.file to ../README.md.
.readme.text

Multiline string with the inline README.

.description

Short description of the repository.

.topics

List of arbitrary topics for this repository. Adding topics here lets you filter on them in the catalog and on the search page.

.sources

List of sources for this dataset. The records here will show up in a special section at the top of the overview page.

Example:

sources:
  - anchor: Name of the source
    href: https://www.example.com
    isCreator: false
    isSameAs: false

This section is also used to populate the schema.org metadata on the repository overview page. In particular, the isCreator and isSameAs flags populate the schema.org creator and sameAs flags, respectively.

.license

Freeform text for the license/restrictions on this dataset, rendered at the top of the overview page.

.extra_metadata

Arbitrary key-value metadata for this repository. This must have two levels of nesting. Example:

extra_metadata:
  data_source:
    source: Postgres
    normalization: none
  internal:
    creator: Some Person
    department: Some Department