Splitgraph has been acquired by EDB! Read the blog post.

Part 2: Deploying to Fly.io

Introduction

In this part of the tutorial, you'll:

  • deploy Seafowl to Fly.io
  • add a persistent volume to it to support writes
  • add the dataset from the previous part to it

Set up Fly.io

Fly.io is a platform for deploying applications around the world. Their free tier allows us to deploy up to 2 shared-cpu VMs with 256MB of RAM each, which is good enough for our purposes.

Follow the Fly docs to:

Normally, you shouldn't need to add a payment method to Fly in order to utilize the free tier, but in some cases you might be asked for one to prevent abuse. If you stay below the free limit, your card won't be charged.

Deploy Seafowl

Make a new directory for our Fly app. It doesn't have to be inside of the directory you created in the previous step.

mkdir seafowl-fly

To deploy Seafowl to Fly we need a Docker image. Luckily, we provide pre-built Docker images with Seafowl1.

Run this:

fly launch --image splitgraph/seafowl:nightly --now

(Tested on flyctl v0.1.63)

Follow the fly CLI instructions to create and deploy the app:

  • For the name, use anything you want (we used seafowl)
  • Pick any region you want (preferably the one closest to you)

This is what the output should look like:

$ fly launch
Creating app in seafowl-fly
? App Name (leave blank to use an auto-generated name): seafowl
? Select organization: Seafowl
? Select region: lhr (London, United Kingdom)
Created app seafowl in organization personal
Wrote config file fly.toml

Deploying seafowl
==> Validating app configuration
--> Validating app configuration done
Services
TCP 80/443 ⇢ 8080
Searching for image 'splitgraph/seafowl:nightly' remotely...
image found: img_n37qpoxm8ynpmz69
Image: registry-1.docker.io/splitgraph/seafowl:nightly
Image size: 49 MB
==> Creating release
...
==> Monitoring deployment

 1 desired, 1 placed, 1 healthy, 0 unhealthy [health checks: 1 total, 1 passing]
--> v1 deployed successfully

Query the instance

You can now find out the domain your Seafowl instance was deployed to:

$ fly status

App
  Name     = seafowl
  Owner    = personal
  Version  = 2
  Status   = running
  Hostname = seafowl.fly.dev

Run a query against it (replace seafowl.fly.dev with your domain):

$ curl -i \
  -H "Content-Type: application/json" \
  https://seafowl.fly.dev/q \
  -d'{"query": "SELECT 2*2"}'

{"Int64(2) * Int64(2)":4}

Make the instance writeable

This Seafowl instance is read-only, but we do want it to have our dataset from the previous step.

First, we need to make your instance writeable2. Seafowl supports being able to source configuration from environment variables, so let's use that to enable writes.

Generate a new password and set it as a Fly secret:

pw=$(< /dev/urandom LC_ALL=C tr -dc A-Za-z0-9 | head -c${1:-32};echo -n)
pw_hash=$(echo -n $pw | sha256sum - | head -c 64)
echo -e "Password: $pw"
export YOUR_PASSWORD=$pw

fly secrets set "SEAFOWL__FRONTEND__HTTP__WRITE_ACCESS=$pw_hash"

This will also redeploy the application to pick up the password and enable writes.

Note: You set the secret to the sha256 hash of the plaintext $YOUR_PASSWORD. Make sure to save the plaintext value of $YOUR_PASSWORD (rather than its hash), because that's what you'll need to make authenticated requests to your Seafowl instance.

Add persistent storage

You can now write to Seafowl! However, since there's no persistent storage, your data will be lost as soon as the application is restarted or redeployed.

Let's add a volume to Seafowl so that it can persist writes to it. You get 1GB for free with Fly, which is also the minimum volume size.

Use the same region as where you deployed Seafowl initially (in our case, it's lhr - if you aren't sure of your region, you can see it with fly status):

fly volumes create seafowl_data --region lhr --size 1

Edit the fly.toml file and add this section to link the volume to Seafowl:

[mounts]
  source="seafowl_data"
  destination="/seafowl-data"

And, redeploy your application!

fly deploy

The data that you now write to Seafowl will be persisted in your Fly.io volume.

Add the dataset

Do you remember the Supply Chains dataset from the previous part? Let's add it to Seafowl. Instead of downloading and re-uploading it, we can tell Seafowl to query it directly from HTTP:

$ curl -i \
  -H "Content-Type: application/json" https://seafowl.fly.dev/q \
  -H "Authorization: Bearer $YOUR_PASSWORD" \
  -d@- <<EOF
{"query": "CREATE EXTERNAL TABLE supply_chains \
STORED AS PARQUET \
LOCATION 'https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet'"}
EOF

You could now query it as-is (Seafowl will cache the parts of the Parquet file that you query), but it'll be faster if we also ingest it into Seafowl:

$ curl -v \
  -H "Content-Type: application/json" https://seafowl.fly.dev/q \
  -H "Authorization: Bearer $YOUR_PASSWORD" \
  -d@- <<EOF
{"query": "CREATE TABLE supply_chains \
AS SELECT * FROM staging.supply_chains"}
EOF

This should take about a minute3.

Query it

Now, you can run exactly the same queries as in the previous part, except this time, this dataset is queryable from anywhere in the world:

$ curl \
  -H "Content-Type: application/json" https://seafowl.fly.dev/q \
  -d@- <<EOF
{"query": "SELECT country_of_production, COUNT(*) AS count \
FROM supply_chains \
GROUP BY 1 ORDER BY 2 DESC"}
EOF

{"country_of_production":"BRAZIL","count":2386600}
{"country_of_production":"ARGENTINA","count":260293}
{"country_of_production":"INDONESIA","count":155751}
{"country_of_production":"ECUADOR","count":96842}
{"country_of_production":"PARAGUAY","count":29848}
{"country_of_production":"COTE D'IVOIRE","count":27636}
{"country_of_production":"COLOMBIA","count":20820}
{"country_of_production":"PERU","count":8484}
{"country_of_production":"GHANA","count":2111}
{"country_of_production":"BOLIVIA","count":806}

Congratulations! You now have a single-node Seafowl deployment running on Fly.io, serving up a SQL API to a real dataset!

What's next?

If you were to now publish a dashboard that uses this API, every visit to it would mean Seafowl would have to rerun all queries used by the dashboard.

But, even if it's an interactive dashboard, the same queries often get run multiple times. In this case, we can vastly improve performance with caching.

Seafowl relies on HTTP caches, the browser's cache and CDNs and doesn't cache query results itself.

In the next part of the guide, we'll query Seafowl's cached GET API. This is the main building block that we need to get Seafowl working with HTTP caches.


  1. You can of course bring your own Dockerfile. See some examples on our GitHub for inspiration.
  2. There are other ways you can add a dataset to Seafowl. For example, you can bake the dataset directly into the Docker image. This limits the size of your dataset but lets you scale Seafowl indefinitely. You can also upload a CSV or a Parquet file to Seafowl without using CREATE EXTERNAL TABLE.
  3. Note that due to the strict free tier RAM limit on Fly (256MB), this upload can very occasionally fail when Seafowl indexes the data. In that case, try again or let us know in our GitHub issues