Welcome to Splitgraph

Splitgraph consists of two sets of tools.

  • Splitgraph, or Splitgraph Core: a tool for building, versioning and querying reproducible datasets. It's inspired by Docker and Git, built with PostgreSQL and integrates with anything that uses PostgreSQL.
  • Splitgraph Cloud, or Splitgraph.com: an SQL proxy, data catalog and Splitgraph peer. It hosts Splitgraph images as well as proxies queries to upstream data vendors, letting you query over 40,000 datasets with any SQL client.

Splitgraph and Splitgraph Cloud can be used independently of each other. For example, you can use sgr to manage data images without using Splitgraph Cloud, or you can query data on the Splitgraph Data Delivery Network without installing Splitgraph locally. You can also use them both at the same time, for example, by pulling data from Splitgraph Cloud to use in Splitfiles.

This documentation is mostly for Splitgraph Core, so we refer to it throughout as Splitgraph. Splitgraph Cloud has its own dedicated section.

Splitgraph

Splitgraph allows the user to manipulate data images (snapshots of SQL tables at a given point in time) as if they were code repositories by versioning, pushing and pulling them.

It works on top of PostgreSQL and uses SQL for all versioning and internal operations. You can "check out" data into actual PostgreSQL tables, offering read/write performance and feature parity with PostgreSQL and allowing you to query it with any SQL client. The client application has no idea that it's talking to a Splitgraph table and you don't need to rewrite any of your tools to use Splitgraph. Anything that works with PostgreSQL will work with Splitgraph.

Splitgraph also defines the declarative Splitfile language with Dockerfile-like caching semantics that allows you to build Splitgraph data images in a composable, maintainable and reproducible way. When you build data with Splitfiles, you get provenance tracking. You can inspect an image's metadata to find the exact upstream images, tables and columns that went into it. With one command, Splitgraph can use this provenance data to rebuild an image against a newer version of its upstream dependencies. You can easily integrate Splitgraph into your existing CI pipelines, to keep your data up-to-date and stay on top of changes to its inputs.

You do not need to download the full Splitgraph image to query it. Instead, you can query Splitgraph images with layered querying, which will download only the regions of the table relevant to your query, using bloom filters and other metadata. This is useful when you're exploring large datasets from your laptop, or when you're only interested in a subset of data from an image. This is still completely transparent to the client application, which sees a PostgreSQL schema that it can talk to using the Postgres wire protocol.

Splitgraph does not limit your data sources to Postgres databases. It includes first-class support for importing and querying data from other databases using Postgres foreign data wrappers. You can create Splitgraph images or query data in MongoDB, MySQL, CSV files, other Postgres databases or even Elasticsearch clusters using the same interface.

Finally, Splitgraph is peer-to-peer. You can push and pull data images between other Splitgraph installations and use it as a standalone tool to supercharge your data workflows.

Splitgraph Cloud

Splitgraph.com, or Splitgraph Cloud is a public Splitgraph instance where you can share and discover data. It's a Splitgraph peer that extends on the Splitgraph Core with extra proprietary features.

Splitgraph Cloud acts as a catalog for external data, indexing over 40,000 open government datasets on the Socrata platform. Analyze coronavirus data with Jupyter and scikit-learn, plot nearby marijuana dispensaries with Metabase and PostGIS or just explore Chicago open data with DBeaver and do so from the comfort of a battle-tested RDBMS with a mature feature set and a rich ecosystem of integrations.

Splitgraph Cloud also gives you more ways to consume data. Every dataset pushed to Splitgraph Cloud gets a REST endpoint provided for it by PostgREST. Also, you can query any dataset on Splitgraph Cloud through our Data Delivery Network, allowing you to access data directly from your SQL client, application or BI dashboard without having to install anything else.

Read the Splitgraph Cloud guide for more information or try our one minute demo for a showcase of the Splitgraph Data Delivery Network.