Frequently Asked Questions

Do I have to use Splitgraph to use sgr?

No. While we use some parts of sgr to power Splitgraph, sgr is a self-contained stand-alone tool. You can use it in a decentralized way, sharing data between two sgr engines like you would with Git.

Here's an example of getting two sgr instances to synchronize with each other. It is also possible to push data to S3-compatible storage (like Minio).

Do I have to download sgr to use Splitgraph?

No. While Splitgraph is a sgr peer, letting you push and pull data between it and your local sgr instance, a lot of its functionality doesn't require you to download sgr.

Is sgr a PostgreSQL extension?

Not quite. The sgr engine ships as a Docker image and is a customized version of PostgreSQL that is fully compatible with existing clients. In the future, we might repackage sgr as a PostgreSQL extension.

Can I add sgr to my existing PostgreSQL deployment?

While it is possible to add sgr to existing PostgreSQL deployments, there isn't currently a simple installation method. If you're interested in doing so, you can follow the instructions in the Dockerfile used to build the engine or contact us.

You can also add the sgr engine as a PostgreSQL logical replication client, which will let you ingest data from existing databases without installing sgr on them.

Does my data have to be in PostgreSQL to use sgr?

With mounting, you can query data in other databases (including MongoDB, MySQL, PostgreSQL or Elasticsearch) directly through Splitgraph with any PostgreSQL client. You do not need to copy your data into PostgreSQL to use sgr.

Can I use sgr with my existing tools?

Yes. Any PostgreSQL client is able to query Splitgraph repositories (directly or through layered querying), including DataGrip, pgAdmin or other clients like pgcli or DBeaver.

In addition, sgr can enhance a lot of existing applications and extensions that work with PostgreSQL. We have examples of using sgr with Jupyter notebooks, PostGIS, PostgREST, dbt or Metabase.

What's the performance like? Do you have any benchmarks?

We maintain a couple of Jupyter notebooks with benchmarks on our GitHub.

It's difficult to specify what is considered a benchmark for sgr, as for a lot of operations one would be benchmarking PostgreSQL itself. This is why we haven't run benchmarks like TPC-DS on sgr (since for maximum performance, it's easy to check out a Splitgraph image into a PostgreSQL schema) but have tested the overhead of various sgr workloads over PostgreSQL.

In short:

  • Committing and checking out Splitgraph images takes slightly less time than writing the same data to PostgreSQL tables (sgr moves data directly between PostgreSQL tables without query parsing overhead)
  • Writing to PostgreSQL tables that are change-tracked by sgr is almost 2x slower than writing to untracked tables (sgr uses audit triggers to record changes rather than diffing the table at commit time).
  • Splitgraph images take up much less (5x-10x) space than equivalent PostgreSQL tables due to it using cstore_fdw for storage.
  • Querying Splitgraph images directly without checkout (layered querying) can sometimes be faster and use less IO than querying PostgreSQL tables.

Can sgr be used for big datasets?

Yes. sgr has a few optimizations that make it suitable for working with large datasets:

  • Datasets are partitioned into fragments stored in a columnar format which is superior to row-format storage for OLAP workloads.
  • You can query Splitgraph images without checking them out or even downloading them completely. With layered querying, sgr can lazily download a small fraction of the table needed for the query. This is still completely seamless to the client application.

Since sgr is built on top of PostgreSQL, you can use the same methods for horizontally scaling a PostgreSQL deployment to scale a sgr engine.