Frequently Asked Questions

What is Splitgraph?

Splitgraph is a tool for building, versioning, querying and sharing datasets that works on top of PostgreSQL and integrates seamlessly with anything that uses PostgreSQL.

What's the difference between Splitgraph and Splitgraph Cloud?

When we say Splitgraph, in this documentation we mostly refer to Splitgraph Core. You can use it in a stand-alone way to augment your data workflows or share data with any other Splitgraph instances in a peer-to-peer fashion.

Splitgraph.com, or Splitgraph Cloud, is a public Splitgraph instance where you can share and discover data. We sometimes also call it the Splitgraph registry, analogous to the Docker registry.

It's a Splitgraph peer powered by the Splitgraph Core code, adding proprietary features like:

  • data catalog to discover and organize data
  • multitenancy
  • caching SQL proxy that lets you query Splitgraph datasets and other databases with any PostgreSQL client
  • OpenAPI-compatible REST endpoint for any version of any Splitgraph dataset

We have a dedicated documentation section for Splitgraph Cloud.

Do I have to use Splitgraph Cloud to use Splitgraph?

No. While we use the Splitgraph code to power Splitgraph Cloud, Splitgraph is a self-contained stand-alone tool. You can use it in a decentralized way, sharing data between two Splitgraph engines like you would with Git.

Here's an example of getting two Splitgraph instances to synchronize with each other. It is also possible to push data to S3-compatible storage (like Minio).

You can use Splitgraph Cloud if you wish to get or share public data, query data through the Data Delivery Network or have a REST API generated for your dataset.

Most Splitgraph Cloud functionality is a polished and scalable multi-tenant version of what you can build with Splitgraph.

For example, you can use Splitgraph's mounting capability to run federated queries across multiple databases. We use mounting to power our Data Delivery Network, giving users the experience of a public PostgreSQL database with tens of thousands of datasets at their fingertips, queryable with any SQL client.

You can also set up PostgREST with Splitgraph to generate a REST API for your datasets. We use a fork of PostgREST in Splitgraph Cloud to power our REST API.

Do I have to download Splitgraph to use Splitgraph Cloud?

No. While Splitgraph Cloud is a Splitgraph remote, letting you push and pull data between it and your local Splitgraph instance, a lot of its functionality doesn't require you to download Splitgraph.

In particular, you can connect to Splitgraph's Data Delivery Network at postgresql://data.splitgraph.com:5432/ddn with any PostgreSQL client without having to install anything else. You can get credentials in less than 30 seconds at splitgraph.com/connect. We support login via GitHub, GitLab and Google OAuth.

You can also query Splitgraph's REST API at https://data.splitgraph.com/[namespace]/[repository]/latest/-/rest that we provide for any dataset on Splitgraph.

What is the Splitgraph Data Delivery Network?

The Splitgraph Data Delivery Network is a Splitgraph Cloud feature that lets you connect to Splitgraph with any PostgreSQL client without downloading anything else. It allows you to query any of the over 40,000 datasets hosted on or proxied by Splitgraph with plain SQL. You can even run JOIN queries between distinct datasets.

On a technical level, you can think of the Splitgraph DDN as a PostgreSQL wire-level proxy and firewall. We use open-source technology like PostgreSQL foreign data wrappers and pgBouncer as well as Splitgraph Core's features like mounting and layered querying to route and execute federated queries across databases, third-party data providers, and Splitgraph images.

We even implement the ANSI information schema standard: when your client connects to the Splitgraph Data Delivery Network, it will show you a list of featured repositories as well as data that you previously uploaded to Splitgraph.

All of this gives you a seamless experience of a single public database with thousands of datasets at your fingertips, compatible with most existing clients and BI tools.

Is Splitgraph a PostgreSQL extension?

Not quite. The Splitgraph engine ships as a Docker image and is a customized version of PostgreSQL that is fully compatible with existing clients. In the future, we might repackage Splitgraph as a PostgreSQL extension.

Can I add Splitgraph to my existing PostgreSQL deployment?

While it is possible to add Splitgraph to existing PostgreSQL deployments, there isn't currently a simple installation method. If you're interested in doing so, you can follow the instructions in the Dockerfile used to build the engine or contact us.

You can also add the Splitgraph engine as a PostgreSQL logical replication client, which will let you ingest data from existing databases without installing Splitgraph on them.

Does my data have to be in PostgreSQL to use Splitgraph?

With mounting, you can query data in other databases (including MongoDB, MySQL, PostgreSQL or Elasticsearch) directly through Spligraph with any PostgreSQL client. You do not need to copy your data into PostgreSQL to use Splitgraph.

Why PostgreSQL? Why not write your own database?

PostgreSQL is a battle-tested RDBMS with a mature feature set (ACID transactions, authentication, indexing, crash safety), a long list of commercial and enthusiast users and a rich ecosystem of extensions and applications. We don't think that features added by Splitgraph warrant building a brand new database.

Can I use Splitgraph with my existing tools?

Yes. Any PostgreSQL client is able to query Splitgraph datasets (directly or through layered querying), including DataGrip, pgAdmin or other clients like pgcli or DBeaver.

If you're a user of Splitgraph Cloud, you can use most of your SQL clients to query data from the Splitgraph Data Delivery Network without downloading Splitgraph.

In addition, Splitgraph can enhance a lot of existing applications and extensions that work with PostgreSQL. We have examples of using Splitgraph with Jupyter notebooks, PostGIS, PostgREST, dbt or Metabase.

Philosophically, we believe that great tools enhance existing abstractions without breaking them:

  • Git adds versioning to the filesystem, enhancing it for any tools that use the filesystem. Compilers, IDEs or editors don't need to be aware of Git to reap its benefits. You don’t need to adopt a new filesystem or use a specialized IDE to version your code with Git.
  • Docker allows you to build images without forcing you to change your code to work in Docker or recompiling it to use some special Docker system calls. The application does not need to know it's running inside of a container, and so you don't need to rewrite it to run in Docker.

Git and Docker are good tools because they stay out of the way while still adding value. We were guided by the same principles when building Splitgraph.

What's the performance like? Do you have any benchmarks?

We maintain a couple of Jupyter notebooks with benchmarks on our GitHub.

It's difficult to specify what is considered a benchmark for Splitgraph, as for a lot of operations one would be benchmarking PostgreSQL itself. This is why we haven't run benchmarks like TPC-DS on Splitgraph (since for maximum performance, it's easy to check out a Splitgraph image into a PostgreSQL schema) but have tested the overhead of various Splitgraph workloads over PostgreSQL.

In short:

  • Committing and checking out Splitgraph images takes slightly less time than writing the same data to PostgreSQL tables (Splitgraph moves data directly between PostgreSQL tables without query parsing overhead)
  • Writing to PostgreSQL tables that are change-tracked by Splitgraph is almost 2x slower than writing to untracked tables (Splitgraph uses audit triggers to record changes rather than diffing the table at commit time).
  • Splitgraph images take up much less (5x-10x) space than equivalent PostgreSQL tables due to it using cstore_fdw for storage.
  • Querying Splitgraph images directly without checkout (layered querying) can sometimes be faster and use less IO than querying PostgreSQL tables.

Can Splitgraph be used for big datasets?

Yes. Splitgraph has a few optimizations that make it suitable for working with large datasets:

  • Datasets are partitioned into fragments stored in a columnar format which is superior to row-format storage for OLAP workloads.
  • You can query Splitgraph images without checking them out or even downloading them completely. With layered querying, Splitgraph can lazily download a small fraction of the table needed for the query. This is still completely seamless to the client application.

Since Splitgraph is built on top of PostgreSQL, you can use the same methods for horizontally scaling a PostgreSQL deployment to scale a Splitgraph engine.

What are the general ideas behind Splitgraph?

Splitgraph tries to provide a framework for dealing with common data science problems. Rather than taking an existing tool and asking "what would it look like if we applied it to data", we asked ourselves "what do current data management workflows look like, how are they lacking and what can we borrow from other disciplines that can be useful?"

We tried to get our inspiration from multiple tools and best practices in software engineering: versioning, self-contained artifacts, continuous integration, reproducible builds and the idea of tools that improve existing workflows without breaking them and forcing users to change them.

Why not just use...

dbt

dbt is a tool for transforming data inside of the data warehouse that allows users to build up transformations from reusable and versionable SQL snippets.

Splitgraph enhances dbt: since a Splitgraph engine is also a PostgreSQL instance, dbt can work against it, getting benefits like version control, packaging and sharing to datasets that it uses and builds.

We have an example of running dbt in such a way, swapping between different versions of the source dataset and looking at their effect on the built dbt model.

In addition, we have written a Splitgraph dbt adapter that allows you to reference Splitgraph images directly from your dbt model.

Splitgraph also offers its own method of building datasets: Splitfiles. Splitfiles offer Dockerfile-like caching, provenance tracking, fast dataset rebuilds, joins between datasets and full SQL support.

We envision Splitfiles as a replacement for ETL pipelines: instead of a series of processes that transform data between tables in a data warehouse, Splitgraph treats transformations as pure functions between isolated self-contained datasets, allowing you to replay any part of your pipeline at any point in time.

Databricks

Databricks offers a full take-it-or-leave-it platform for doing data science, including a managed Spark cluster, whereas Splitgraph allows you to pick and choose parts of your workflow that you wish to swap out with Splitgraph.

Rather than ad hoc notebooks, Splitgraph encourages users to transform data using reproducible Splitfiles that allow hermetic dataset builds.

You can use the same Splitgraph workflows in development, with the user's local engine, and production, with a remote engine. This helps decrease iteration time and lets you maintain a single codebase for your data transformations.

Pachyderm

Pachyderm is used mostly for managing and running distributed data pipelines on flat files (images, genomics data, etc). In contrast, Splitgraph specializes in datasets represented as tables in a database, providing benefits like delta compression on changed data and faster querying speeds.

Similarly to Pachyderm, Splitgraph supports data lineage (or provenance) tracking where the metadata of a downstream dataset includes a record of the upstream commands and sources used to build it. This way, you can inspect the provenance of a dataset and rebuild it when its upstream data changes.

You can integrate Splitgraph with Pachyderm using the same methods you would use for PostgreSQL. Then you can run a Splitfile to build a dataset as a Pachyderm stage.

dvc, DataLad, ...

Some tools use git-annex to version code and data together. Splitgraph's versioning improves on that by delta compressing changes (which means that bringing a dataset up to date only requires downloading the changes, rather than the new version) and putting the data inside of an actual database, making querying it more efficient.

FoundationDB, Dolt, ...

A lot of tools that do data versioning also require users to learn to use a brand new database and rewrite their existing applications to use it. We don't think the benefits from writing a new database system from scratch outweigh the costs of bringing it up to feature and performance parity with existing databases and bootstrapping an ecosystem of extensions and applications.