splitgraph.yml: Terraform for your data stack
splitgraph.yml format, which lets you programmatically manage your datasets on Splitgraph, change their data source settings and define dbt transformations.
Splitgraph is building the Unified Data Stack – an integrated and modern solution for working with data without worrying about its infrastructure.
There is a trend in data engineering to apply best practices from software engineering, for example, version control and code reviews. Splitgraph itself was originally inspired by Git and Docker, both famous developer tools.
Another DevOps tool we like is Terraform: it lets you use a declarative configuration language to define your cloud infrastructure. Terraform then takes care of interpreting this configuration and calling cloud vendor-specific APIs to create/destroy resources and align the real state of the world with the desired configuration.
We built something similar for the data stack:
splitgraph.yml, or Splitgraph Cloud project files.
You can think of
splitgraph.yml as a Terraform file that lets you programmatically configure a dataset in Splitgraph, including its:
We use this format ourselves to declare data ingestion pipelines with Airbyte, build our own data warehouse with dbt and organize them in our Splitgraph data catalog (see a previous blog post for more details!)
In this post, we'll talk about how you can use
splitgraph.yml to define your own data pipeline and show an end-to-end example that ingests some sample data from Socrata and transforms it using dbt.
Here is a basic example that builds two Splitgraph repositories:
- namespace: splitgraph-demo
# Catalog-specific metadata for the repository.
text: For more information, see https://www.splitgraph.com/cityofnewyork-us/for-hire-vehicles-fhv-active-8wbx-tsch
description: New York Active For Hire Vehicles
- new york
# Data source settings
url: "" # Will be overridden by the individual table
# Schema of the table (set to  to infer)
# Enable live querying for the plugin (creates a "live" tag in the
# repository proxying to the data source).
# dbt model that processes the previous repository
- namespace: splitgraph-demo
# Map dbt source names to Splitgraph repositories
- dbt_source_name: for_hire_vehicles
description: Sample dbt model
text: "## Sample dbt model\n\nThis is a model referencing data\
\ from:\n\n * [/splitgraph-demo/for-hire-vehicles](/splitgraph-demo/for-hire-vehicles)"
The full project that uses this file is located on our GitHub. You can also query the resultant dataset on Splitgraph:
The simplest way to get started with
splitgraph.yml is by using our new "project seeds" functionality. This will generate a GitHub repository for you with a
splitgraph.yml file, a dbt model and a GitHub Action file that you can use as a starting point to define your own pipeline.
We've done some magic to make the process smooth for you, so you can select data sources, copy a repo on GitHub, and paste a seed value into a GitHub action, to set up all your boilerplate.
Click the button to get started:
Splitgraph supports over 100 sources. These include SaaS services (powered by the open Airbyte standard) as well as "live queries" to the most popular RDBMSs and data warehouses. The latter, powered by PostgreSQL Foreign Data Wrappers, lets you query data through Splitgraph without having to load it.
The wizard will guide you through using our Splitgraph Cloud template project to generate a project from a "seed" (a representation of the data sources you chose).
After you go through the process, you will end up with a repository with Splitgraph Cloud project files:
splitgraph.yml file that defines your chosen data
build.moveme.yml in the root with a GitHub Action definition used to ingest the data on a schedule or on-demand
We created the example project using this method.
You will need to perform some extra steps to finish the setup (see the README in the generated repository for more information):
splitgraph.credentials.yml (not required for our sample project, as the Socrata endpoint we're using to get the data is public)
splitgraph.yml to set up the data source parameters
build.moveme.yml to the
splitgraph.yml with the
The example GitHub Action above calls the
sgr CLI to interact with Splitgraph Cloud and manage datasets and pipelines.
You can run the same commands from your development machine or from another workflow orchestrator, like GitLab CI. Let's look at these commands in detail.
sgr cloud stub: generate a
This command will generate a basic
splitgraph.yml file for a specific plugin. You can use this file as a starting point (documentation).
$ sgr cloud stub csv splitgraph-demo/csv
csv: # This is the name of this credential that "external" sections can reference.
# Credential-specific data matching the plugin's credential schema
s3_access_key: '' # AWS Access Key
s3_secret_key: '' # AWS Secret Access Key
- namespace: splitgraph-demo
# Catalog-specific metadata for the repository. Optional.
description: Description of the repository
# Data source settings for the repository. Optional.
# Name of the credential that the plugin uses. This can also be a credential_id if the
# credential is already registered on Splitgraph.
# Plugin-specific parameters matching the plugin's parameters schema
connection: # Connection. Choose one of:
- connection_type: http # REQUIRED. Connection type. Constant
url: '' # REQUIRED. URL. HTTP URL to the CSV file
- connection_type: s3 # REQUIRED. Connection type. Constant
s3_endpoint: '' # REQUIRED. S3 endpoint. S3 endpoint (including port if required)
s3_bucket: '' # REQUIRED. Bucket name. Bucket the object is in
s3_region: '' # S3 region. Region of the S3 bucket
s3_secure: false # Secure. Whether to use HTTPS for S3 access
s3_object: '' # S3 Object name. Limit the import to a single object
s3_object_prefix: '' # S3 Object prefix. Prefix for object in S3 bucket
sgr cloud load: set up the catalog metadata
This command (documentation) uploads the README, tags, description as well as external data source settings and credentials to Splitgraph Cloud.
sgr cloud sync: run a data load
This command (documentation) starts an ingestion job. If you previously ran
sgr cloud load, you can use the existing data source settings uploaded to Splitgraph Cloud. You can also make this command use an existing
splitgraph.yml file, which will run ingestion without Splitgraph Cloud storing your data source credentials.
sgr cloud commands
There are other useful
sgr cloud commands. See our documentation for a full list:
sgr cloud upload: upload a CSV file to Splitgraph Cloud
sgr cloud download: export a SQL query result as a CSV file
sgr cloud plugins: search for data source plugins
sgr cloud search: search for repositories
sgr cloud sql: run SQL queries
sgr cloud status: check the status of ingestion jobs
An in-depth reference for the
splitgraph.yml format is available here.
Configure datasets individually via the web GUI. You can try it here, but note it's in alpha status and some configurations are not yet fully supported:
You can still store your catalog in the
splitgraph.yml format with this method: use the
sgr cloud dump command to dump existing repositories to
In this blog post, we showcased
splitgraph.yml: a configuration format that lets you programmatically manage your datasets on Splitgraph. You can find an example project on our GitHub.
The example project also shows how to make the data repositories that you build private. You can then use our sharing settings and invites functionality to manage who has access to your datasets and invite collaborators.
Interested in tighter security guarantees or more team features? We also provide private versions of Splitgraph Cloud, deployed to a cloud vendor and region of your choice. Feel free to try it out!