Splitgraph has been acquired by EDB! Read the blog post.

Socrata

Introduction

The Socrata data platform hosts tens of thousands of government datasets. Governments large and small publish data on crime, permits, finance, healthcare, research, performance, and more for citizens to use. They publish the data to a white-labeled "data portal" provided by Socrata. There are over 200 of these data portals, and Splitgraph indexes the data from all of them.

sgr has first-class support for querying Socrata datasets through SQL and using them in Splitfiles. Support for Socrata is implemented as a foreign data wrapper. This allows you to use sgr as a PostgreSQL-to-Socrata connector. Any PostgreSQL application, client or dashboarding tool can query Socrata datasets through sgr and even run joins on datasets from different Socrata data portals or between Socrata datasets and Splitgraph images.

Usage

Each dataset has a unique "four-by-four" (a Socrata dataset ID, for example, 28km-gtjn) and is hosted on a certain domain ("data portal"), for example, data.cityofchicago.org. Use the sgr client to mount the dataset:

sgr mount socrata chicago_data -o @- <<EOF
{
    "domain": "data.cityofchicago.org",
    "tables": {"fire_stations": "28km-gtjn"},
    "app_token": "YOUR_APP_TOKEN"
}
EOF

This will create a table fire_stations in chicago_data schema on your engine that, when queried, will rewrite requests into SoQL queries and forward them to the Socrata server, querying the relevant dataset. In this case, it will query the Chicago Fire Stations dataset.

sgr will automatically discover the table schema and other metadata from the Socrata API.

The app token is optional but requests without it are anonymous and Socrata may throttle them. See the Socrata API reference on how to get an app token.

Finally, the tables field is optional as well. Without it, sgr will mount all datasets provided by that domain as foreign tables, giving the tables human-readable names consisting of the Socrata dataset name and the Socrata dataset ID, for example, building_violations_22u3_xenr. This will not download any actual data from Socrata but will let you explore all data on that domain in any PostgreSQL client.

Full mounting takes a few seconds on a domain that serves up 500 datasets.

Note that sgr currently only supports Socrata datasets (not data lenses, calendars, external links etc).

Discovering Socrata datasets

Splitgraph catalog

Splitgraph maintains a metadataset of available Socrata datasets at splitgraph/socrata. The easiest way to search through this catalog from the command line is to clone the data image and query it:

$ sgr clone splitgraph/socrata:latest --download-all
$ sgr sql -i splitgraph/socrata:latest \\
    "SELECT domain, id, name FROM datasets WHERE name ILIKE '%covid%'"

data.austintexas.gov  4p54-9544  Austin Code COVID-19 Complaint Cases
data.calgary.ca       uq24-jkwv  COVID-19 Requests
data.cambridgema.gov  4nyp-vuze  OLD - Confirmed Cambridge COVID-19 Cases - OLD
data.cambridgema.gov  inw8-ircw  Confirmed COVID-19 Cases in Cambridge
data.cambridgema.gov  tdt9-vq5y  COVID-19 Cumulative Cases by Date
data.cdc.gov          9bhg-hcku  Provisional COVID-19 Death Counts by Sex, Age, and State
data.cdc.gov          b58h-s9zx  Provider Relief Fund COVID-19 High-Impact Payments
data.cdc.gov          hc4f-j6nb  Provisional Death Counts for Coronavirus Disease (COVID-19)
data.cdc.gov          hk9y-quqm  Conditions contributing to deaths involving coronavirus disease 2019 (COVID-19), by age group, United States.
data.cdc.gov          kn79-hsxy  Provisional COVID-19 Death Counts in the United States by County
...

This is not an exhaustive list: it is populated from the Socrata Discovery API at https://api.us.socrata.com/api/catalog/v1. See the Socrata Discovery API documentation for reference.

On the web, you can also use the dataset search feature of Splitgraph, explore featured data, or explore data topics.

Other Socrata domains

You can query any Socrata dataset through sgr, not just those that are indexed in the catalog. If you have found a dataset you wish to query (on your internal Socrata instance or through a search engine such as the Open Data Network), just note its domain and dataset ID and use the sgr mount command as normal. As long as the domain is a Socrata data portal with API access, this will work.

Examples

Splitfile

FROM MOUNT socrata
  '{"domain": "data.cityofchicago.org",
    "tables": {"business_licenses": "r5kz-chrr",
               "business_owners": "ezma-pppn"}}'
IMPORT {
  SELECT
    bo.owner_first_name,
    bo.owner_last_name,
    bl.legal_name,
    bl.address,
    bl.city,
    bl.license_id,
    bl.expiration_date
  FROM business_licenses bl
  JOIN business_owners bo
  ON bl.account_number = bo.account_number
  WHERE bl.license_description = 'Tavern'
  AND bl.expiration_date > '2020-05-01'
} AS chicago_bars

This Splitfile mounts two Socrata datasets: Chicago Business Licenses and Chicago Business Owners and runs an IMPORT to join between them and snapshot the data.

The result is a list of all bar owners in Chicago with current liquor licenses.

Exploring data with DBeaver

The Socrata Discover API in conjunction with sgr's mounting capabilities allows you to explore Socrata data from any SQL client.

First, mount Socrata open data endpoints for Chicago, IL and Cambridge, MA:

$ sgr mount socrata chicago -o '{"domain": "data.cityofchicago.org"}'
Connecting to remote server...
Mounting Socrata domain...
Getting Socrata metadata
warning: Requests made without an app_token will be subject to strict throttling limits.
Loaded metadata for 504 Socrata tables

$ sgr mount socrata cambridge -o '{"domain": "data.cambridgema.gov"}'
Connecting to remote server...
Mounting Socrata domain...
Getting Socrata metadata
warning: Requests made without an app_token will be subject to strict throttling limits.
Loaded metadata for 137 Socrata tables

This doesn't actually load any data and only loads metadata for all available datasets, creating PostgreSQL foreign tables for each one of them. Such tables act as lightweight shims and most clients will treat them like normal tables (sometimes introspection may be different, or they may show up in a list of "foreign tables" rather than "tables").

Open a PostgreSQL client (instructions are available for DataGrip, pgAdmin or other clients like pgcli or DBeaver) and connect to the sgr engine. In these screenshots, DBeaver is used.

DBeaver schema overview

Chicago data is now mounted in the chicago schema and Cambridge in cambridge. Every table maps to a Socrata dataset, with the name being automatically generated from the dataset name and the Socrata ID.

DBeaver overview

sgr also grabs descriptions for each dataset and every column from Socrata metadata, if it exists. These descriptions are added as PostgreSQL COMMENT entries on tables and columns.

sgr loads mounted data on demand, similar to layered querying. It only sends queries to Socrata when the client introspects or queries a table.

DBeaver querying overview

In addition, you can use filters, ORDER, LIMIT and OFFSET clauses in your queries, which sgr will rewrite into SoQL queries when sending them to Socrata, so your local sgr instance doesn't need to load the whole dataset.

DBeaver filtering overview

Using Metabase to join and plot data from multiple data portals

In this section, we'll use Metabase, a powerful visualization and dashboarding tool, to plot a table that joins data from two distinct Socrata data portals. Normally, this would involve some ETL jobs and a Jupyter notebook, but with sgr, we'll be able to do this directly in one simple SQL query.

Follow the Metabase integration guide to set up Metabase with sgr and make sure the Chicago and Cambridge data endpoints are mounted (see previous section).

First, let's try plotting the COVID-19 Daily Cases and Deaths Socrata dataset (Socrata ID naz8-j4nc):

SELECT
    lab_report_date, cases_total
FROM
    chicago.covid19_daily_cases_deaths_and_hospitalizations_naz8_j4nc chicago_cases
ORDER BY lab_report_date ASC;

Metabase plot of one dataset

Note that this is run against the chicago schema that has all available tables from the data.cityofchicago.org Socrata domain mounted. You can also mount just the table you're interested in and give it a more readable name:

$ sgr mount socrata -o '{"domain": "data.cityofchicago.org", \\
                         "tables": {"covid_cases": "naz8-j4nc"}}'

Now, let's compare new daily cases in Chicago and Cambridge. You can do this with a simple JOIN. Of course referential integrity is not guaranteed to exist, but assuming the data is properly cleaned, we just have to make sure both tables are in terms of the date, rather than the timestamp, so that they can be matched up:

(Note: If you do need to clean data, you could always load it into a Splitgraph image, perform the necessary modifications on it, and then use that image as the source instead of the mounted data).

SELECT
    cambridge_cases.date AS date,
    chicago_cases.cases_total AS chicago_daily_cases,
    cambridge_cases.new_positive_cases AS cambridge_daily_cases
FROM
    chicago.covid19_daily_cases_deaths_and_hospitalizations_naz8_j4nc chicago_cases
FULL OUTER JOIN
    cambridge.covid19_case_count_by_date_axxk_jvk8 cambridge_cases
ON
    date_trunc('day', chicago_cases.lab_report_date) = cambridge_cases.date::timestamp
ORDER BY date ASC;

Metabase plot of two datasets

Behind the scenes, this queries the Chicago COVID-19 Daily Cases and Deaths and the Cambridge COVID-19 Cumulative Cases by Date Socrata datasets using SoQL and the actual JOIN is performed in PostgreSQL. However, Metabase and any other PostgreSQL client doesn't need to know about that; to them, these look like normal PostgreSQL tables.

Licensing and contact

Socrata datasets are not hosted by Splitgraph (although many are indexed by Splitgraph) and the sgr engine queries them through the public Socrata API. While most datasets hosted by Socrata are in the public domain, you should always check the dataset's metadata for support, licensing information and contact information.