Splitgraph has been acquired by EDB! Read the blog post.

Keeping Apollo Cache up-to-date after mutations

Aug 9, 2023 · By Grzegorz Rozdzialik

READING TIME: 11 min

Parsing pgSQL with tree-sitter in WASM

We used tree-sitter to show the tables referenced in the query inside the Splitgraph Console.

The Splitgraph Console now analyzes the SQL query on the fly and displays the referenced tables in the sidebar for easier query comprehension. The feature is powered by tree-sitter running in WASM.

Interested in how we got it working? Read on!

The Splitgraph Console offers an IDE-like experience for interacting with Splitgraph and the data it exposes. One fundamental feature is the query editor. It resembles a code editor and is used for editing SQL (PostgreSQL, in fact) queries, rather than files.

The query editor has smart features like:

autocomplete
We use pgcli on the backend to retrieve the completions.
syntax highlighting
We use the SQL Monaco language configuration, which uses the Monarch library under the hood.

One interesting feature we wanted to have was to introspect the query and display the tables it references. This would let the user quickly understand and interact with a subset of repositories and tables, rather than all repositories and tables within some namespace.

We implemented this feature using tree-sitter compiled to WASM (web-tree-sitter) so the parsing happens within the user's browser. The rest of this article outlines the problems faced and the decisions we took to get the feature working.

Parsing in the browser or on the server

The first question we had to answer was whether the parsing process should happen in the browser, like our syntax highlighting, or on the server, like our autocomplete.

There are pros and cons to both approaches:

parsing in the browser has a higher initial cost (fetching and initializing the parser), but offers faster feedback during editing, and transfers less data during long editing sessions
the browser can only execute JavaScript or WASM. The parser would have to be compiled to either of these languages. This limits our choices, as we cannot run arbitrary existing parsers.
either the browser or the server have to do more work to parse the query. For in-browser parsing, this could be slow for low-end devices. For server parsing, we could DDoS our servers by having many users ask for parsing of their queries.

At this point, either approach would work. We had a slight preference for doing in-browser parsing for its lower latency and being able to use the parsed AST for more features in the future.

Existing SQL parsers

Parsing SQL is a popular problem which has many solutions in various languages.

libpg_query and its bindings

libpg_query is a C library for parsing PostgreSQL. It uses the same code that the PostgreSQL server uses. This makes the parsing the most accurate and spec-compliant.

There are bindings for using libpg_query in other languages:

pgsql-parser for Node.js
pg-query-emscripten for WASM (available in the browser)
pg_query.rs for Rust
Trivia: this is what Supabase's postgres_lsp uses to parse pgSQL.

The importance of syntax error recovery

One major downside behind using any library based on libpg_query is that it has no syntax error recovery when the query contains syntax errors. If the parser encounters an error, it does not attempt to parse the rest of the code. It all comes down to libpg_query having a single error field per parse tree.

This makes it unsuitable for usage in code editors. When editing, most of the time the code has a syntax error, because the user is still typing the word, or is editing some more complex statement.

For example, given the following query:

SELECT
    "gid",
    "state"
FROM
    "splitgraph/election-geodata:79724cbf5dcd2b260ac0d60e58135123d527ff4d2e1dc709f6da47fa8f4ee71f"."nation"
LIMIT 100;

When the user wants to select one more column in a SELECT statement:

SELECT
    "gid",
    ,
    -- look, a syntax error. There cannot be a trailing comma above
    "state"
FROM
    "splitgraph/election-geodata:79724cbf5dcd2b260ac0d60e58135123d527ff4d2e1dc709f6da47fa8f4ee71f"."nation"
LIMIT 100;

we should still be able to parse the query and know it references splitgraph/election-geodata despite the query having a syntax error.

This makes syntax error recovery a must-have for our parser.

sql-parser from sql-language-server

sql-parser is a JavaScript parser implemented using Peggy (a JavaScript parser generator). It is part of the sql-language-server.

Lack of syntax error recovery disqualifies this library. It also did not support using schemas in table names (e.g. "splitgraph/election-geodata"."nation" was considered a syntax error, but "splitgraph/election-geodata" was not).

tree-sitter

tree-sitter is a parser generator. The generated parsers include syntax error recovery. When there is a syntax error, only a few AST nodes get marked as ERROR, but the rest of the code is parsed like normal. It also supports incremental parsing. This means doing a re-parse of the code after editing it is more efficient than parsing from scratch if only we provide tree-sitter the previous AST and the description of the modified range of code.

Both of these qualities make tree-sitter well-suited for code editors. In fact, it is used in neovim, Atom, Emacs, and Helix, to name a few. Tree-sitter also powers basic IDE operations in web versions of VSCode (like https://vscode.dev/) thanks to the vscode-anycode extension. It also powers parsing of bash in the bash-language-server.

See Introduction to tree-sitter for a more detailed description of what tree-sitter can do.

Tree-sitter has bindings in many languages. The ones relevant to us are:

web-tree-sitter for WASM,
node-tree-sitter for Node.js.

As for the grammar, tree-sitter-sql is a well-maintained SQL grammar used in Neovim. In our testing, it successfully parsed the pgSQL we threw at it, and the parts it stumbled on were fixed in just a handful of days.

All of the above led us to decide tree-sitter is the right choice for parsing pgSQL in the Console.

Extracting referenced tables from the query

Tree-sitter parses the query into an AST. The corpus tests of the grammar show the stringified AST node types. Now the challenge becomes extracting the referenced tables from that AST.

Neovim's :InspectTree command is extremely useful at understanding the AST while interactively editing code.

Neovim with the source query in the left window and the parsed AST in the
right window. The `relation` AST node is selected in the AST window and the
table reference is highlighted in the query.

Here we can also see the AST parsed from a query that has a syntax error.

Neovim with the source query in the left window and the parsed AST in the
right window. There is a syntax error in the query, but the parsed AST still contains a `relation`.

Tree-sitter also supports declaratively querying the AST using queries written as S-expressions. It takes away the burden of walking the AST and looking for subtrees that match a certain shape. We ended up with the following query to capture the referenced tables:

(
  relation (
    (
      object_reference
      schema: (identifier)
      name: (identifier)
    ) @reference
  )
)

Neovim with the source query in the left window, the parsed AST in the
right-top window, and the tree-sitter query playground in the right-bottom. The matched `relation` AST node and source code is highlighted.

This query ended up capturing the table reference we care about.

Extracting parts of a table reference

Tree-sitter gives us back the table reference as a string like "splitgraph/election-geodata:79724cbf5dcd2b260ac0d60e58135123d527ff4d2e1dc709f6da47fa8f4ee71f"."nation". Now it is our job to extract the namespace, repository name, image tag, and the table name from this reference.

This felt like a good job for a regular expression. We ended up with the following one and wrote a bunch of test cases to make sure it also tolerates errors (e.g. a missing table name at the end when the user is still typing it):

const tableReferenceRegExp = /^"((?<namespace>[^/]+)?)(\/((?<repositoryName>[^":]+)?)(:((?<imageTag>[^"]+)?))?"?(\."?((?<tableName>[^"]+)?)"?)?)?$/;

It uses named capturing groups to make extracting the reference parts easier.

Result

Tying these pieces together, the Splitgraph Console query editor now supports displaying referenced tables in the sidebar as a tree. It updates nearly instantly. We added a debounce to avoid updating the UI so often.

Splitgraph Console query editor with referenced tables shown in the sidebar.

We found that the tree-sitter parsing process usually takes less than 1ms on a laptop. Using 6x CPU throttling in Chrome still leads to a usable UI.

Check it out

Boost your SQL writing experience with live query introspection.

Bonus: supabase/postgres_lsp discussion

As we were working on integrating tree-sitter in the Splitgraph Console, Supabase was working on the postgres_lsp, a language server for PostgreSQL.

Creating a language server involves much more work than just the parser. It can provide diagnostics, autocomplete, semantic highlighting, code actions, "Go to definition", and many other features.

The parser is still at the core of a language server. Omitting error recovery from the parser used in a language server can lead to frustrating experience for the users.

For example, with semantic highlighting, the user will have a jarring experience if the query stops being highlighted each other keystroke due to intermittent parsing errors.

Autocomplete also is problematic to get working when suggestions are only available when the query is valid. Most of the time the user will request suggestions for a half-written query, which, most likely, is not a valid query yet. If the server does not implement syntax error recovery during parsing, it will keep yielding empty or wrong suggestions.

TypeScript server (tsserver) which is used in the TypeScript Language Server, has advanced syntax error recovery which offers stellar TypeScript authoring experience. Syntax error recovery is heavily integrated into the parser, for example:

At the time of writing this article (2023-08-09), postgres_lsp employs some measures to handle syntax errors returned by pg_query.rs (source). Their LSP is still at a very early stage and only supports semantic highlighting (and keeps crashing quite often), which makes it hard to assess. We are very interested to see if using a parser with no syntax error recovery as a base will be a viable solution.

That said, we are extremely excited that there is development in this space.

Writing UDFs in Golang

Oct 24, 2023 ·By Artjoms Iškovs, Miles Richardson

1 min read

EDB acquires Splitgraph

Splitgraph is joining EDB, the leader in accelerating Postgres in the enterprise.

Aug 30, 2023 ·By Peter Neumark

15 min read

How we built a ChatGPT plugin for Splitgraph

We built a ChatGPT plugin for querying Splitgraph without writing SQL. Along the way, we discovered a better way to write LLM-powered applications.

Aug 24, 2023 ·By Patrick Skinner

3 min read

Writing UDFs in Golang

We return to the wide world of WASM. In our previous round we implemented a UDF using Rust; today we're porting it to Golang.

Aug 9, 2023 ·By Grzegorz Rozdzialik

11 min read

Parsing pgSQL with tree-sitter in WASM

We used tree-sitter to show the tables referenced in the query inside the Splitgraph Console.

Jun 19, 2023 ·By Grzegorz Rozdzialik

14 min read

Keeping Apollo Cache up-to-date after mutations

A discussion of approaches to keep Apollo cache consistent with the API data after invoking GraphQL mutations.

Jun 14, 2023 ·By Peter Neumark

6 min read

Building a GPT-powered agent to answer questions using data from Splitgraph

Follow along as we build a GPT-powered bot capable of answering natural language questions by finding relevant Splitgraph repositories and querying them via automatically generated SQL.

May 24, 2023 ·By Patrick Skinner

9 min read

Deploying a serverless Seafowl DB to Google Cloud Run using GCS FUSE and SQLite

Learn how to combine Seafowl with GCS FUSE to achieve true scale to zero. Serve users at the edge with a web (HTTP)-first analytical database that works on GCP Cloud Run, including within the "always free" tier.

May 22, 2023 ·By Peter Neumark

5 min read

Using Dagster with Seafowl

Import the result of your data pipeline into Seafowl easily using Dagster!

Apr 12, 2023 ·By Peter Neumark

3 min read

SQLite file uploads

Importing SQLite data into Splitgraph is now as easy as drag-n-drop!

Apr 11, 2023 ·By Marko Grujić

14 min read

A Lakehouse by the sea: Migrating Seafowl storage layer to delta-rs

Announcing the replacement of our custom storage layer with Delta Lake, facilitated by its open-source Rust implementation.

Jan 12, 2023 ·By Patrick Skinner

6 min read

Open Data Monitor

How can we track open government datasets over time? Say hello to Open Data Monitor, a Socrata tracking tool powered by Seafowl and Splitgraph.

Dec 14, 2022 ·By Marko Grujić

12 min read

Rust visitor pattern and efficient DataFusion query federation

We explore the inner workings of DataFusion filter pushdown optimisation, and see what it takes to ship them to remote data sources.

Dec 9, 2022 ·By Peter Neumark

9 min read

Deciding if I'm urban with WebAssembly and Seafowl

I used Seafowl to analyze how much of a city slicker I am—with geographic user-defined WASM functions and caffeine addiction!

Nov 29, 2022 ·By Peter Neumark

8 min read

Extending Seafowl with WebAssembly

How can Seafowl users extend the database's builtin capabilities in a safe, portable and efficient way? The answer is WebAssembly, read on to learn how!

Nov 18, 2022 ·By Marko Grujić

10 min read

Table partitioning and time travel queries: Seafowl case study

We discuss how Seafowl performs table partitioning to enable efficient versioning and time travel queries

Oct 12, 2022 ·By Artjoms Iškovs

12 min read

(Ab)using CDNs for SQL queries

A deep dive into how we designed Seafowl's REST API to be HTTP cache and CDN friendly, including some discussion of ETags and other HTTP cache mechanics.

Oct 9, 2022 ·By Artjoms Iškovs

6 min read

Seafowl: a database for analytics at the edge

Our new project: a CDN-friendly analytical database that's up to 10x faster than PostgreSQL and up to 5x faster than Splitgraph.

Aug 17, 2022 ·By Patrick Skinner

4 min read

SELECT directly from the browser

How Splitgraph's DDN HTTP API lets you run SQL queries directly from the browser, opening new possiblities for client-side data-driven apps.

Jul 6, 2022 ·By Patrick Skinner

6 min read

Building a data-driven app with Splitgraph and Streamlit

We demonstrate how combining Splitgraph and Streamlit lets devs and data scientists more easily build data-driven apps. In this example we plot NYC subway turnstile data to try and glean how NYC's Covid recovery is going.

Jun 8, 2022 ·By Artjoms Iškovs

9 min read

Solving Sudoku with Poetry's dependency resolver

An unexpected use case for one of Python's most popular package managers.

May 4, 2022 ·By Artjoms Iškovs

9 min read

`splitgraph.yml`: Terraform for your data stack

We showcase the splitgraph.yml format, which lets you programmatically manage your datasets on Splitgraph, change their data source settings and define dbt transformations.

Apr 28, 2022 ·By Peter Neumark

1 min read

Planning a vacation with Splitgraph and Observable

Query public Splitgraph repositories from Observable notebooks, by importing the Splitgraph Observable Client to send SQL queries over HTTP to the Splitgraph Data Delivery Network (DDN).

Apr 14, 2022 ·By Peter Neumark

2 min read

Get your own private Splitgraph data portal

Deploy a demo instance of a Dedicated Data Portal, so you can experiment with Splitgraph features in a single-tenant environment. Try it today, free for 7 days, no credit card required.

Mar 7, 2022 ·By Peter Neumark

10 min read

Combining multiple GraphQL backends with schema stitching

Read about we use GraphQL schema stitching to provide a single coherent schema for accessing several services with shared types from overlapping GraphQL schemas.

Feb 14, 2022 ·By Marko Grujić

10 min read

PostgreSQL FDW aggregation pushdown part III: Elasticsearch edition

We continue our series on aggregation pushdown and turn our attention to our Elasticsearch FDW. Implementation details, performance considerations as well as a few Postgres tidbits are shared.

Feb 9, 2022 ·By Marko Grujić

9 min read

PostgreSQL FDW aggregation pushdown part II: Snowflake speedup

We demonstrate a concrete application of aggregation pushdown mechanism in the form of our Snowflake FDW. Actual performance benefits are quantified for a selection of real-life examples.

Feb 8, 2022 ·By Artjoms Iškovs

3 min read

Share datasets like Notion pages

Splitgraph now supports advanced data sharing settings. Make a repository private, invite a collaborator and control their level of access, all from a simple Web UI.

Feb 4, 2022 ·By Marko Grujić

15 min read

PostgreSQL FDW aggregation pushdown part I: modifying Multicorn

We recently implemented support for aggregation and grouping pushdown in the Multicorn FDW. In this post, we'll demonstrate it on a simple toy example and discuss how PostgreSQL aggregation pushdown works in general.

Feb 3, 2022 ·By Artjoms Iškovs

10 min read

Scheduling, versioning and cataloging: introducing our dbt integration

We showcase the ability to run dbt models on Splitgraph, triggering them on a schedule as well as using GitHub Actions. We also talk about how it works and share more plans for our dbt integration.

Feb 2, 2022 ·By Miles Richardson

3 min read

Drag, drop and share CSV files as queryable SQL tables

You can now upload CSV files to Splitgraph from the web, so that you can query them with SQL by pointing your Postgres client to the Data Delivery Network (DDN). Share data publicly or with only those you invite – all discoverable and queryable from a single unified interface.

Dec 23, 2021 ·By Artjoms Iškovs

7 min read

Airbyte, dbt, Splitgraph: how we built our modern data stack

We talk about our modernized data stack that uses Airbyte for data ingestion, dbt for transformations and Splitgraph itself for storage, versioning, discoverability and querying.

Dec 20, 2021 ·By Marko Grujić

5 min read

Preview Environments: Spinning up temporary Splitgraph instances from any commit

We talk about how we use GitLab's review apps functionality to preview and test Splitgraph Cloud deployments.

Sep 18, 2020 ·By Artjoms Iškovs

10 min read

Dogfooding Splitgraph for cross-database analytics in Metabase

We talk about how we use Metabase, Splitgraph and PostgreSQL foreign data wrappers to build BI dashboards that are backed by federated queries across our Matomo and Elasticsearch instances.

Aug 19, 2020 ·By Artjoms Iškovs, Miles Richardson

7 min read

Port 5432 is open: introducing the Splitgraph Data Delivery Network

We launch the Splitgraph Data Delivery Network: a single endpoint that lets any PostgreSQL application, client or BI tool to connect and query over 40,000 public datasets hosted or proxied by Splitgraph.

Jul 28, 2020 ·By Artjoms Iškovs

8 min read

Splitgraph infrastructure, part 3: Using Docker Compose in production

We finish our overview of Splitgraph's infrastructure by talking about why and how we use Docker Compose to run the Splitgraph registry in production.

Jul 14, 2020 ·By Artjoms Iškovs

13 min read

Supercharging dbt with Splitgraph: versioning, sharing, cross-DB joins

We discuss how you can use Splitgraph with dbt to add versioning and cross-database joins to dbt models. We also show how to use dbt to reference Splitgraph datasets, including through a purpose-built Splitgraph adapter.

Jul 13, 2020 ·By Artjoms Iškovs

6 min read

Throwing away the backend: Towards a data delivery network

We discuss the trends of serverless and edge computing, talk about why our SQL server is open to the public and propose the idea of a data delivery network.

Jul 8, 2020 ·By Miles Richardson

7 min read

Querying 40,000+ datasets with SQL

Learn about how Splitgraph indexes over 40,000 datasets from government and public sources using the Socrata API, Splitgraph mounting, and PostgreSQL foreign data wrappers.

Jul 8, 2020 ·By Artjoms Iškovs

7 min read

Splitgraph infrastructure, part 2: Integration testing with Docker Compose

We continue our overview of Splitgraph's internal build infrastructure by talking about how we run end-to-end integration tests. We also discuss using Jinja to generate configuration and inject secrets into our components.

Jul 6, 2020 ·By Artjoms Iškovs

8 min read

Splitgraph infrastructure, part 1: Using Make to build multiple Docker images efficiently

We begin our overview of Splitgraph's internal build infrastructure by discussing how we build Docker images in development and CI using Make and Docker BuildKit.

Jul 2, 2020 ·By Artjoms Iškovs

9 min read

Foreign data wrappers: PostgreSQL's secret weapon?

We talk about foreign data wrappers, a PostgreSQL feature that lets you query remote databases directly from your PostgreSQL instance. We also demonstrate how to integrate them with Splitgraph.

Jun 30, 2020 ·By Artjoms Iškovs

5 min read

Treat your datasets like cattle, not pets

We talk about the "pets versus cattle" idea in software and discuss how Splitgraph helps to apply it to data science and data engineering.

Jun 26, 2020 ·By Artjoms Iškovs

6 min read

It took 10 minutes to add support for DataGrip to Splitgraph

We discuss a philosophy of not breaking existing abstractions that we think explains the success of tools like Docker and Git and how we applied it to Splitgraph, helping us launch with multiple integrations.

Jun 24, 2020 ·By Miles Richardson, Artjoms Iškovs

6 min read

Welcome to Splitgraph

Announcing Splitgraph, a data versioning and management system that allows you to work with data like you work with code.

Splitgraph has been acquired by EDB! Read the blog post.

Parsing pgSQL with tree-sitter in WASM

We used tree-sitter to show the tables referenced in the query inside the Splitgraph Console.

Parsing in the browser or on the server

Existing SQL parsers

libpg_query and its bindings

The importance of syntax error recovery

sql-parser from sql-language-server

tree-sitter

Extracting referenced tables from the query

Extracting parts of a table reference

Result

Check it out

Bonus: supabase/postgres_lsp discussion

Product

Support

Company

Splitgraph

Splitgraph has been acquired by EDB! Read the blog post.

Parsing pgSQL with tree-sitter in WASM

We used tree-sitter to show the tables referenced in the query inside the Splitgraph Console.

Parsing in the browser or on the server

Existing SQL parsers

libpg_query and its bindings

The importance of syntax error recovery

sql-parser from sql-language-server

tree-sitter

Extracting referenced tables from the query

Extracting parts of a table reference

Result

Check it out

Bonus: supabase/postgres_lsp discussion

EDB acquires Splitgraph

How we built a ChatGPT plugin for Splitgraph

Writing UDFs in Golang

Parsing pgSQL with tree-sitter in WASM

Keeping Apollo Cache up-to-date after mutations

Building a GPT-powered agent to answer questions using data from Splitgraph

Deploying a serverless Seafowl DB to Google Cloud Run using GCS FUSE and SQLite

Using Dagster with Seafowl

SQLite file uploads

A Lakehouse by the sea: Migrating Seafowl storage layer to delta-rs

Open Data Monitor

Rust visitor pattern and efficient DataFusion query federation

Deciding if I'm urban with WebAssembly and Seafowl

Extending Seafowl with WebAssembly

Table partitioning and time travel queries: Seafowl case study

(Ab)using CDNs for SQL queries

Seafowl: a database for analytics at the edge

SELECT directly from the browser

Building a data-driven app with Splitgraph and Streamlit

Solving Sudoku with Poetry's dependency resolver

splitgraph.yml: Terraform for your data stack

Planning a vacation with Splitgraph and Observable

Get your own private Splitgraph data portal

Combining multiple GraphQL backends with schema stitching

PostgreSQL FDW aggregation pushdown part III: Elasticsearch edition

PostgreSQL FDW aggregation pushdown part II: Snowflake speedup

Share datasets like Notion pages

PostgreSQL FDW aggregation pushdown part I: modifying Multicorn

Scheduling, versioning and cataloging: introducing our dbt integration

Drag, drop and share CSV files as queryable SQL tables

Airbyte, dbt, Splitgraph: how we built our modern data stack

Preview Environments: Spinning up temporary Splitgraph instances from any commit

Dogfooding Splitgraph for cross-database analytics in Metabase

Port 5432 is open: introducing the Splitgraph Data Delivery Network

Splitgraph infrastructure, part 3: Using Docker Compose in production

Supercharging dbt with Splitgraph: versioning, sharing, cross-DB joins

Throwing away the backend: Towards a data delivery network

Querying 40,000+ datasets with SQL

Splitgraph infrastructure, part 2: Integration testing with Docker Compose

Splitgraph infrastructure, part 1: Using Make to build multiple Docker images efficiently

Foreign data wrappers: PostgreSQL's secret weapon?

Treat your datasets like cattle, not pets

It took 10 minutes to add support for DataGrip to Splitgraph

Welcome to Splitgraph

Product

Support

Company

Community

Splitgraph

`splitgraph.yml`: Terraform for your data stack