Frequently Asked Questions

What is Splitgraph?

Splitgraph is a tool for building, versioning, querying and sharing datasets that works on top of PostgreSQL and integrates seamlessly with anything that uses PostgreSQL.

Is Splitgraph a PostgreSQL extension?

Not quite. The Splitgraph engine ships as a Docker image and is a customized version of PostgreSQL that is fully compatible with existing clients. In the future, we might repackage Splitgraph as a PostgreSQL extension.

Can I add Splitgraph to my existing PostgreSQL deployment?

While it is possible to add Splitgraph to existing PostgreSQL deployments, there isn't currently a simple installation method. If you're interested in doing so, you can follow the instructions in the Dockerfile used to build the engine or contact us.

You can also add the Splitgraph engine as a PostgreSQL logical replication client, which will let you ingest data from existing databases without installing Splitgraph on them.

Does my data have to be in PostgreSQL to use Splitgraph?

With foreign data wrappers, you can query data in other databases (including MongoDB, MySQL or PostgreSQL) directly through Spligraph with any PostgreSQL client. You do not need to copy your data into PostgreSQL to use Splitgraph.

Why PostgreSQL? Why not write your own database?

PostgreSQL is a battle-tested RDBMS with a mature feature set (ACID transactions, authentication, indexing, crash safety), a long list of commercial and enthusiast users and a rich ecosystem of extensions and applications. We don't think that features added by Splitgraph warrant building a brand new database.

Can I use Splitgraph with my existing tools?

Yes. Any PostgreSQL client is able to query Splitgraph datasets (directly or through layered querying), including DataGrip, pgAdmin or other clients like pgcli or DBeaver.

In addition, Splitgraph can enhance a lot of existing applications and extensions that work with PostgreSQL. We have examples of using Splitgraph with Jupyter notebooks, PostGIS, PostgREST, dbt or Metabase.

Philosophically, we believe that great tools enhance existing abstractions without breaking them:

  • Git adds versioning to the filesystem, enhancing it for any tools that use the filesystem. Compilers, IDEs or editors don't need to be aware of Git to reap its benefits. You don’t need to adopt a new filesystem or use a specialized IDE to version your code with Git.
  • Docker allows you to build images without forcing you to change your code to work in Docker or recompiling it to use some special Docker system calls. The application does not need to know it's running inside of a container, and so you don't need to rewrite it to run in Docker.

Git and Docker are good tools because they stay out of the way while still adding value. We were guided by the same principles when building Splitgraph.

What's the performance like? Do you have any benchmarks?

We maintain a couple of Jupyter notebooks with benchmarks on our GitHub.

It's difficult to specify what is considered a benchmark for Splitgraph, as for a lot of operations one would be benchmarking PostgreSQL itself. This is why we haven't run benchmarks like TPC-DS on Splitgraph (since for maximum performance, it's easy to check out a Splitgraph image into a PostgreSQL schema) but have tested the overhead of various Splitgraph workloads over PostgreSQL.

In short:

Can Splitgraph be used for big datasets?

Yes. Splitgraph has a few optimizations that make it suitable for working with large datasets:

  • Datasets are partitioned into fragments stored in a columnar format which is superior to row-format storage for OLAP workloads.
  • You can query Splitgraph images without checking them out or even downloading them completely. With layered querying, Splitgraph can lazily download a small fraction of the table needed for the query. This is still completely seamless to the client application.

Since Splitgraph is built on top of PostgreSQL, you can use the same methods for horizontally scaling a PostgreSQL deployment to scale a Splitgraph engine.

Do I have to register anywhere to use Splitgraph?

No. You can use Splitgraph in a decentralized way, sharing data between two engines like you would with Git. Here's an example of getting two Splitgraph instances to synchronize with each other.

It is also possible to push data to S3-compatible storage (like Minio).

You can use Splitgraph Cloud if you wish to get or share public data or have a REST API generated for your dataset.

What are the general ideas behind Splitgraph?

Splitgraph tries to provide a framework for dealing with common data science problems. Rather than taking an existing tool and asking "what would it look like if we applied it to data", we asked ourselves "what do current data management workflows look like, how are they lacking and what can we borrow from other disciplines that can be useful?"

We tried to get our inspiration from multiple tools and best practices in software engineering: versioning, self-contained artifacts, continuous integration, reproducible builds and the idea of tools that improve existing workflows without breaking them and forcing users to change them.

Why not just use...

dbt

dbt is a tool for transforming data inside of the data warehouse that allows users to build up transformations from reusable and versionable SQL snippets.

Splitgraph enhances dbt: since a Splitgraph engine is also a PostgreSQL instance, dbt can work against it, getting benefits like version control, packaging and sharing to datasets that it uses and builds.

We have an example of running dbt in such a way, swapping between different versions of the source dataset and looking at their effect on the built dbt model.

In addition, we have written a Splitgraph dbt adapter that allows you to reference Splitgraph images directly from your dbt model.

Splitgraph also offers its own method of building datasets: Splitfiles. Splitfiles offer Dockerfile-like caching, provenance tracking, fast dataset rebuilds, joins between datasets and full SQL support.

We envision Splitfiles as a replacement for ETL pipelines: instead of a series of processes that transform data between tables in a data warehouse, Splitgraph treats transformations as pure functions between isolated self-contained datasets, allowing you to replay any part of your pipeline at any point in time.

Databricks

Databricks offers a full take-it-or-leave-it platform for doing data science, including a managed Spark cluster, whereas Splitgraph allows you to pick and choose parts of your workflow that you wish to swap out with Splitgraph.

Rather than ad hoc notebooks, Splitgraph encourages users to transform data using reproducible Splitfiles that allow hermetic dataset builds.

You can use the same Splitgraph workflows in development, with the user's local engine, and production, with a remote engine. This helps decrease iteration time and lets you maintain a single codebase for your data transformations.

Pachyderm

Pachyderm is used mostly for managing and running distributed data pipelines on flat files (images, genomics data, etc). In contrast, Splitgraph specializes in datasets represented as tables in a database, providing benefits like delta compression on changed data and faster querying speeds.

Similarly to Pachyderm, Splitgraph supports data lineage (or provenance) tracking where the metadata of a downstream dataset includes a record of the upstream commands and sources used to build it. This way, you can inspect the provenance of a dataset and rebuild it when its upstream data changes.

You can integrate Splitgraph with Pachyderm using the same methods you would use for PostgreSQL. Then you can run a Splitfile to build a dataset as a Pachyderm stage.

dvc, DataLad, ...

Some tools use git-annex to version code and data together. Splitgraph's versioning improves on that by delta compressing changes (which means that bringing a dataset up to date only requires downloading the changes, rather than the new version) and putting the data inside of an actual database, making querying it more efficient.

FoundationDB, Dolt, ...

A lot of tools that do data versioning also require users to learn to use a brand new database and rewrite their existing applications to use it. We don't think the benefits from writing a new database system from scratch outweigh the costs of bringing it up to feature and performance parity with existing databases and bootstrapping an ecosystem of extensions and applications.