Splitgraph has been acquired by EDB! Read the blog post.

Part 1: Running Seafowl locally

Introduction

In this part of the tutorial, you'll:

  • install Seafowl locally
  • write some toy data to it
  • load and query a real dataset in the Parquet format

Download a Seafowl binary for your system

Seafowl ships as a single binary for Linux, OSX and Windows. You can get release binaries from our GitHub nightly builds page:

Make a directory for Seafowl and its data:

mkdir seafowl
cd seafowl

Download the binary and unzip it:

# Pick one
TRIPLE="unknown-linux-gnu"
# TRIPLE="x86_64-apple-darwin"
# TRIPLE="pc-windows-msvc"
wget "https://nightly.link/splitgraph/seafowl/workflows/nightly/main/seafowl-nightly-$TRIPLE.zip"
unzip seafowl-nightly-x86_64-unknown-linux-gnu.zip
chmod +x ./seafowl   # no need to do this if you're on Windows

Start Seafowl

You don't need a configuration file or any other services to start Seafowl. Just run:

./seafowl   # seafowl.exe if you're on Windows

You will see Seafowl print out a startup log:

INFO  seafowl > Starting Seafowl
INFO  seafowl::config::schema > Writing to Seafowl will require a password. Randomly generated password: BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M
INFO  seafowl::config::schema > The SHA-256 hash will be stored in the config as follows:
INFO  seafowl::config::schema > [frontend.http]
INFO  seafowl::config::schema > write_access = "a548f4bfe3c1d696f5f651f7036dd5e9f5e8642281f859d53109a4f27273b49e"
INFO  seafowl                 > Writing a default configuration file to seafowl.toml
INFO  seafowl                 > Starting the HTTP frontend on 127.0.0.1:8080
INFO  seafowl                 > HTTP access settings: read any, write password
INFO  warp::server            > Server::run; addr=127.0.0.1:8080
INFO  warp::server            > listening on http://127.0.0.1:8080

Seafowl is going to store the data (the catalog and the object store2) in your current working directory. It also wrote out a configuration file, seafowl.toml.

Run some simple queries

Now that we're up and running, let's run a basic SQL query against Seafowl's HTTP interface:

$ curl -H "Content-Type: application/json" localhost:8080/q -d'{"query": "SELECT 1"}'

{"Int64(1)":1}

It works! Let's try some more complex expressions:

$ curl -H "Content-Type: application/json" localhost:8080/q -d'{"query": "SELECT 2*2"}'

{"Int64(2) * Int64(2)":4}

$ curl -H "Content-Type: application/json" localhost:8080/q -d'{"query": "SELECT 2*2 AS result"}'
{"result":4}

Let's list all tables in our database:

$ curl -H "Content-Type: application/json" localhost:8080/q -d'{"query": "SHOW TABLES"}'
{"table_catalog":"default","table_schema":"information_schema","table_name":"tables","table_type":"VIEW"}
{"table_catalog":"default","table_schema":"information_schema","table_name":"columns","table_type":"VIEW"}

There's nothing here apart from information_schema tables, used by database clients to programmatically inspect a database's structure. Let's fix that.

Write some data

Let's create a table and add some data to it.

$ curl -i -H "Content-Type: application/json" localhost:8080/q -d@-<<EOF
{"query": "CREATE TABLE first_table (
    name VARCHAR,
    value DOUBLE,
    other_value INT
)"}
EOF
HTTP/1.1 403 Forbidden
content-length: 15
date: Mon, 22 Aug 2022 14:40:11 GMT

WRITE_FORBIDDEN

What happened? By default, Seafowl doesn't let anonymous users write data to your database. Instead, it requires a password to perform writes. You may have seen it flash in the logs when you first started Seafowl1:

INFO  seafowl::config::schema > Writing to Seafowl will require a password. Randomly generated password: BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M
INFO  seafowl::config::schema > The SHA-256 hash will be stored in the config as follows:
INFO  seafowl::config::schema > [frontend.http]
INFO  seafowl::config::schema > write_access = "a548f4bfe3c1d696f5f651f7036dd5e9f5e8642281f859d53109a4f27273b49e"

Let's try again, now with a password (use your own one here):

$ export YOUR_PASSWORD = "BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M"
$ curl -i \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $YOUR_PASSWORD" \
localhost:8080/q -d@-<<EOF
{"query": "CREATE TABLE first_table (
    name VARCHAR,
    value DOUBLE,
    other_value INT
)"}
EOF

HTTP/1.1 200 OK
content-type: application/octet-stream
content-length: 0
date: Mon, 22 Aug 2022 14:45:43 GMT

That looks promising. Let's now write some test data into our table:

$ curl -i \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $YOUR_PASSWORD" \
localhost:8080/q -d@-<<EOF
{"query": "INSERT INTO first_table
  VALUES
    ('Alpha', 2.3, 17),
    ('Alpha', 2.4, 18),
    ('Beta', 9.3, 42),
    ('Beta', 4.0, 18),
    ('Gamma', 6.8, 2),
    ('Gamma', 1.2, 8)
"}
EOF

HTTP/1.1 200 OK
content-type: application/octet-stream
content-length: 0
date: Mon, 22 Aug 2022 14:53:50 GMT

And run some queries on it:

$ curl -H "Content-Type: application/json" localhost:8080/q \
  -d'{"query": "SELECT COUNT(*) FROM first_table"}'

{"COUNT(UInt8(1))":6}

$ curl -H "Content-Type: application/json" localhost:8080/q \
  -d'{"query": "SELECT name, SUM(value) FROM first_table GROUP BY 1"}'
{"name":"Gamma","SUM(first_table.value)":10.0}
{"name":"Alpha","SUM(first_table.value)":35.0}
{"name":"Beta","SUM(first_table.value)":60.0}

$ curl -H "Content-Type: application/json" localhost:8080/q \
 -d'{"query": "SELECT name, other_value, AVG(value) FROM first_table GROUP BY 1,
2 ORDER BY 3 DESC"}'

{"name":"Beta","other_value":9,"AVG(first_table.value)":42.0}
{"name":"Beta","other_value":4,"AVG(first_table.value)":18.0}
{"name":"Alpha","other_value":2,"AVG(first_table.value)":17.5}
{"name":"Gamma","other_value":1,"AVG(first_table.value)":8.0}
{"name":"Gamma","other_value":6,"AVG(first_table.value)":2.0}

Query some real data

We will now experiment with a real dataset. Trase's Supply Chains is a dataset of environmental risks related to commodity supply chains. The dashboard at https://explore.trase.earth/ is powered by Splitgraph, Seafowl's parent project, but we have mirrored the data to a single Parquet file that you'll now import to Seafowl.

Download the file:

wget https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet

Create an "external table" in Seafowl that points to this file:

$ curl \
  -H "Content-Type: application/json" localhost:8080/q \
  -H "Authorization: Bearer $YOUR_PASSWORD" \
  -d@- <<EOF
{"query": "CREATE EXTERNAL TABLE supply_chains \
STORED AS PARQUET \
LOCATION '$(realpath trase-supply-chains.parquet)'"}
EOF

(here, the path is relative to the Seafowl server's working directory, so we used realpath to make it absolute).

External tables are temporary and are created in a special staging schema2.

We can now query this Parquet file:

$ curl -H "Content-Type: application/json" localhost:8080/q \
 -d'{"query": "SELECT COUNT(*) FROM staging.supply_chains"}'

{"COUNT(UInt8(1))":2989191}

$ curl \
  -H "Content-Type: application/json" localhost:8080/q \
  -d@- <<EOF
{"query": "SELECT country_of_production, COUNT(*) AS count \
FROM staging.supply_chains \
GROUP BY 1 ORDER BY 2 DESC"}
EOF

{"country_of_production":"BRAZIL","count":2386600}
{"country_of_production":"ARGENTINA","count":260293}
{"country_of_production":"INDONESIA","count":155751}
{"country_of_production":"ECUADOR","count":96842}
{"country_of_production":"PARAGUAY","count":29848}
{"country_of_production":"COTE D'IVOIRE","count":27636}
{"country_of_production":"COLOMBIA","count":20820}
{"country_of_production":"PERU","count":8484}
{"country_of_production":"GHANA","count":2111}
{"country_of_production":"BOLIVIA","count":806}

Finally, let's actually ingest the table into Seafowl. Internally, Seafowl also stores tables as Parquet files, but it splits them into partitions and adds extra indexing information to accelerate queries3:

$ curl \
  -H "Content-Type: application/json" localhost:8080/q \
  -H "Authorization: Bearer BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M" \
  -d@- <<EOF
{"query": "CREATE TABLE supply_chains \
AS SELECT * FROM staging.supply_chains"}
EOF

Now you can query the real Seafowl table:

$ curl \
  -H "Content-Type: application/json" localhost:8080/q \
  -d@- <<EOF
{"query": "SELECT country_of_production, COUNT(*) AS count \
FROM supply_chains \
GROUP BY 1 ORDER BY 2 DESC"}
EOF
{"country_of_production":"BRAZIL","count":2386600}
{"country_of_production":"ARGENTINA","count":260293}
{"country_of_production":"INDONESIA","count":155751}
{"country_of_production":"ECUADOR","count":96842}
{"country_of_production":"PARAGUAY","count":29848}
{"country_of_production":"COTE D'IVOIRE","count":27636}
{"country_of_production":"COLOMBIA","count":20820}
{"country_of_production":"PERU","count":8484}
{"country_of_production":"GHANA","count":2111}
{"country_of_production":"BOLIVIA","count":806}

What's next?

Next, we'll deploy a Seafowl instance to Fly.io, so that your application can run these queries from the user's browser and from anywhere in the world.


  1. More on external tables in the dedicated guide.
  2. Exposing the password in the logs is a bad idea, since logs often get shipped to analysis/alerting tools and stored forever. We do this as a tradeoff between a fresh Seafowl instance being at least somewhat secure and not requiring the user to go through a setup step before starting it. In other cases, when you're configuring Seafowl's HTTP frontend, not explicitly enabling writes disables them.
  3. Find out more about how Seafowl stores data in the learning section.