Splitfiles

Splitfiles are inspired by Dockerfiles and are the simplest way to build Splitgraph images.

Here's an example Splitfile:

FROM splitgraph/geonames:latest IMPORT {
    SELECT
        -- Note we need to escape the closing curly brace here for the
        -- Splitfile executor.
        substring(alternatenames FROM ',([A-Z]{3\}),') AS iata_code,
        name,
        latitude,
        longitude,
        geonameid
    FROM all_countries
    WHERE feature_code = 'AIRP'
    AND country_code = 'US'
    AND substring(alternatenames FROM ',([A-Z]{3\}),') IS NOT NULL
} AS airports

SQL {
    CREATE TABLE us_flights_2009 AS
    SELECT
        origin_airport AS origin_iata,
        ao.name AS origin_name,
        ao.latitude AS origin_lat,
        ao.longitude AS origin_lon,

        destination_airport AS destination_iata,
        ad.name AS destination_name,
        ad.latitude AS destination_lat,
        ad.longitude AS destination_lon,

        SUM(passengers) AS total_passengers,
        SUM(seats) AS total_seats,
        SUM(flights) AS flights
    FROM "splitgraph/domestic_us_flights:latest".flights f
        JOIN airports ao ON f.origin_airport = ao.iata_code
        JOIN airports ad ON f.destination_airport = ad.iata_code
    WHERE
        EXTRACT('year' FROM fly_month) = 2009
        GROUP BY origin_airport, destination_airport,
            ao.name, ao.latitude, ao.longitude,
            ad.name, ad.latitude, ad.longitude
}

They offer multiple advantages over using ad hoc queries or other tools:

  • Dockerfile-like caching: every Splitfile line defines a layer. If data sources and the actual command haven't changed, Splitgraph does not need to rebuild the image.
  • Referencing other datasets: Splitfiles can import data from other Splitgraph images (using the FROM ... IMPORT ... command) or perform joins on other datasets (by referencing them in the SQL command as schemata).
  • Provenance tracking: when you build images with Splitfiles, their metadata includes the sources and commands used to build them. You can inspect images to know exactly where its data came from, and quickly rebuild it when the upstream data changes.
  • Full SQL support: Splitgraph only does some minor rewriting and validation on SQL in Splitfiles and defers to PostgreSQL for running actual queries, which means that all SQL constructs are supported.
  • Efficient data imports: The Splitfile executor uses layered querying to import data. This allows it to satisfy queries to huge remote datasets by only downloading a few fragments.

See the Splitfile introduction or the Splitfile reference for more information on using Splitfiles to build datasets.