Splitgraph has been acquired by EDB! Read the blog post.

Part 4 (option 2): Delivering query results globally with Cloudflare

Introduction

In this part of the tutorial, we'll put your Seafowl deployment behind Cloudflare CDN. This will let you use Cloudflare's edge network to cache and deliver query results globally.

You need to have a domain that's managed by Cloudflare for this to work. As an alternative, you can also set up Varnish in front of Seafowl.

Add a certificate on Fly.io

Go to the Fly.io dashboard1 for your application, for example, https://fly.io/apps/seafowl.

Click on "Certificates", then "Add certificate":

Enter a hostname for your application. For example, if you have a domain yourdomain.io, you can use seafowl.yourdomain.io. Then, click "Create".

We now need to verify your domain ownership with Fly.io and direct visitors to Seafowl. Click on "View" next to your certificate. You should see verification instructions.

Set up the subdomain on Cloudflare

Go to your Cloudflare dashboard:

Add the required CNAME (ownership verification) and A/AAAA records. Make sure Cloudflare proxying is disabled for both the CNAME record, and the A/AAAA records (grey cloud)!

Finish the setup

Go back to Fly.io and click "Check again" on the certificate dialog. You should see green bullet points next to "Confirm domain ownership" and "Direct visitors to application". This normally takes effect instantly, but could take several minutes:

Finally, enable proxying on Cloudflare again:

Troubleshooting: Redirect Loop

Try sending a simple query to your domain which is now proxied through Cloudflare to Fly:

curl -L -i \
  -H "Content-Type: application/json" \
  https://seafowl.yourdomain.io/q \
  -d'{"query": "SELECT 2*3"}'

Do you get stuck in an infinite redirect loop? This can happen when Cloudflare is trying to connect to http of your Fly app, and Fly is redirecting it to https, causing Cloudflare to redirect you to the same URL you requested.

To fix it, edit fly.toml to set force_https = false:

# ...

  [[services.ports]]
-     force_https = true
+     force_https = false
    handlers = ["http"]

Then redeploy your app:

flyctl deploy

Query Seafowl through Cloudflare

Use the query.sh script from the previous part and pass SEAFOWL_HOST=https://seafowl.yourdomain.io to query Seafowl:

export SEAFOWL_HOST=https://demo.seafowl.io
./query.sh "SELECT name, SUM(value) AS sum FROM test_cache GROUP BY 1"

HTTP/2 200
...
cf-cache-status: DYNAMIC
server: cloudflare
cf-ray: 74248dc99aca54e1-MAN

{"name":"Alpha","sum":35.0}
{"name":"Beta","sum":60.0}
{"name":"Gamma","sum":10.0}
{"name":"Delta","sum":83.0}
{"name":"Kappa","sum":16.0}

That's weird. The cf-cache-status: DYNAMIC header means Cloudflare doesn't think this resource is cacheable.

This is because Cloudflare caches responses based on their extension by default. Files without an extension aren't cached at all.

We could use Cloudflare's Page Rules to get it to cache everything on our domain. Or, we could add one of these extensions to our URL. Let's do that instead.

Edit query.sh

Change the last line in query.sh to request a query result with a .csv extension. Seafowl ignores the extension and doesn't use it to determine the response format, but it will get Cloudflare to cache Seafowl responses.

- curl -i "${headers[@]}" "$SEAFOWL_HOST/q/$hash"
+ curl -i "${headers[@]}" "$SEAFOWL_HOST/q/$hash.csv"

Run the query again

./query.sh "SELECT name, SUM(value) AS sum FROM test_cache GROUP BY 1"
...
cache-control: max-age=14400
cf-cache-status: MISS
server: cloudflare
cf-ray: 7424955fa85f35fb-MAN

{"name":"Alpha","sum":35.0}
{"name":"Beta","sum":60.0}
{"name":"Gamma","sum":10.0}
{"name":"Delta","sum":83.0}
{"name":"Kappa","sum":16.0}

That's better. You'll notice Cloudflare injected a Cache-Control header of its own into the response. This is for the downstream clients, like your user's browser, to know how long to keep the result in the cache for before contacting Cloudflare again.

The second time you run the query, its results will be cached by Cloudflare2:

cf-cache-status: HIT
age: 58

{"name":"Alpha","sum":35.0}
{"name":"Beta","sum":60.0}
{"name":"Gamma","sum":10.0}
{"name":"Delta","sum":83.0}
{"name":"Kappa","sum":16.0}

Test cache invalidation

Let's add more rows to the dataset. Note that we can't use the cached GET endpoint to perform writes:

curl -i \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $YOUR_PASSWORD" \
https://demo.seafowl.io/q -d@-<<EOF
{"query": "INSERT INTO test_cache
  VALUES
    ('Epsilon', 7.4, 49)
"}
EOF

Sadly, the new data won't propagate to Cloudflare immediately, as Seafowl currently doesn't send Cache-Control headers that can force Cloudflare to always revalidate query results. If we were to wait for a few hours and rerun the query, we'd see this3:

cf-cache-status: INVALIDATED
age: 0

{"name":"Epsilon","sum":49.0}
{"name":"Alpha","sum":35.0}
{"name":"Beta","sum":60.0}
{"name":"Gamma","sum":10.0}
{"name":"Delta","sum":83.0}
{"name":"Kappa","sum":16.0}

Troubleshooting: Error 520 from Cloudflare

This is a mysterious error which Cloudflare returns "when the origin server returns an empty, unknown, or unexpected response to Cloudflare.".

This has been known to happen when hosting Seafowl as a Fly app with a shared IPv4 address. The exact trigger of the bug appears to be sending an HTTP request with more than 4 request headers through Cloudflare to Fly, for example:

# this works fine and returns a 200
curl -i 'https://seafowl.yourdomain.io/q/7e323062f1a298c84cbfa790e4157c710b95bfe5e3f6573302881f0a049c485e' -X 'OPTIONS' \
-H 'access-control-request-headers: content-type,x-seafowl-query' \
-H 'access-control-request-method: GET' \
-H 'origin: http://localhost:8000' \
-H 'sec-fetch-mode: cors'

# this (one more header) returns a 520 and never reaches the app (no logs)
curl -i 'https://seafowl.yourdomain.io/q/7e323062f1a298c84cbfa790e4157c710b95bfe5e3f6573302881f0a049c485e' -X 'OPTIONS' \
-H 'access-control-request-headers: content-type,x-seafowl-query' \
-H 'access-control-request-method: GET' \
-H 'origin: http://localhost:8000' \
-H 'sec-fetch-mode: cors' \
-H 'another-header: yes-please'

The error does not happen when sending the request directly to your Fly app (e.g. seafowl.fly.dev).

You are most likely to notice this when querying from the browser, since it adds more request headers than when you're just querying through curl.

Until this bug is fixed, the best solution is to add a dedicated IPv4 address to your Fly app, which costs $2 per month. Once you've signed up for the paid plan, you can add a dedicated IPv4 address to your Fly app with the fly CLI:

fly ips allocate-v4

Once you have a dedicated IPv4, make sure you edit the DNS settings in Cloudflare to use that address for your app.

Next steps

It's time to upgrade from the command line to the browser. In the next part, we'll put everything we learned together to query Seafowl directly from the user's browser and build a beautiful dynamic visualization with Observable.


  1. Thanks to this blog post for the step-by-step guide.
  2. This is not reliable: you might hit a different Cloudflare server and get another MISS.
  3. See the Cloudflare reference for all possible values of CF-Cache-Status.