Socrata
Introduction
The Socrata data platform hosts tens of thousands of government datasets. Governments large and small publish data on crime, permits, finance, healthcare, research, performance, and more for citizens to use. They publish the data to a white-labeled "data portal" provided by Socrata. There are over 200 of these data portals, and Splitgraph indexes the data from all of them.
Splitgraph has first-class support for querying Socrata datasets through SQL and using them in Splitfiles. Support for Socrata is implemented as a foreign data wrapper. This allows you to use Splitgraph as a PostgreSQL-to-Socrata connector. Any PostgreSQL application, client or dashboarding tool can query Socrata datasets through Splitgraph and even run joins on datasets from different Socrata data portals or between Socrata datasets and Splitgraph images.
Usage
Each dataset has a unique "four-by-four" (a Socrata dataset ID, for example, 28km-gtjn
) and is hosted on a certain domain ("data portal"), for example, data.cityofchicago.org
. Use the sgr
client to mount the dataset:
sgr mount socrata chicago_data -o @- <<EOF
{
"domain": "data.cityofchicago.org",
"tables": {"fire_stations": "28km-gtjn"},
"app_token": "YOUR_APP_TOKEN"
}
EOF
This will create a table fire_stations
in chicago_data
schema on your engine that, when queried,
will rewrite requests into SoQL queries and forward them to the Socrata server,
querying the relevant dataset. In this case, it will query the Chicago Fire Stations dataset.
Splitgraph will automatically discover the table schema and other metadata from the Socrata API.
The app token is optional but requests without it are anonymous and Socrata may throttle them. See the Socrata API reference on how to get an app token.
Finally, the tables
field is optional as well. Without it, Splitgraph will mount all datasets provided
by that domain as foreign tables, giving the tables human-readable names consisting of the Socrata dataset name and the Socrata dataset ID, for example, building_violations_22u3_xenr
. This will not download any actual data from Socrata but will let you explore all data on that domain in any PostgreSQL client.
Full mounting takes a few seconds on a domain that serves up 500 datasets.
Note that Splitgraph currently only supports Socrata datasets (not data lenses, calendars, external links etc).
Discovering Socrata datasets
Splitgraph catalog
Splitgraph maintains a metadataset of available Socrata datasets on Splitgraph Cloud at splitgraph/socrata
. The easiest way to search through this catalog from the command line is to clone the data image and query it:
$ sgr clone splitgraph/socrata:latest --download-all
$ sgr sql -i splitgraph/socrata:latest \
"SELECT domain, id, name FROM datasets WHERE name ILIKE '%covid%'"
data.austintexas.gov 4p54-9544 Austin Code COVID-19 Complaint Cases
data.calgary.ca uq24-jkwv COVID-19 Requests
data.cambridgema.gov 4nyp-vuze OLD - Confirmed Cambridge COVID-19 Cases - OLD
data.cambridgema.gov inw8-ircw Confirmed COVID-19 Cases in Cambridge
data.cambridgema.gov tdt9-vq5y COVID-19 Cumulative Cases by Date
data.cdc.gov 9bhg-hcku Provisional COVID-19 Death Counts by Sex, Age, and State
data.cdc.gov b58h-s9zx Provider Relief Fund COVID-19 High-Impact Payments
data.cdc.gov hc4f-j6nb Provisional Death Counts for Coronavirus Disease (COVID-19)
data.cdc.gov hk9y-quqm Conditions contributing to deaths involving coronavirus disease 2019 (COVID-19), by age group, United States.
data.cdc.gov kn79-hsxy Provisional COVID-19 Death Counts in the United States by County
...
This is not an exhaustive list: it is populated from the Socrata Discovery API at https://api.us.socrata.com/api/catalog/v1. See the Socrata Discovery API documentation for reference.
On the web, you can also use the dataset search feature of Splitgraph Cloud, explore featured data, or explore data topics.
Other Socrata domains
You can query any Socrata dataset through Splitgraph, not just those that are indexed in the catalog.
If you have found a dataset you wish to query (on your internal Socrata instance or through a search
engine such as the Open Data Network), just note its domain and
dataset ID and use the sgr mount
command as normal. As long as the domain is a Socrata data portal
with API access, this will work.
Examples
Splitfile
FROM MOUNT socrata
'{"domain": "data.cityofchicago.org",
"tables": {"business_licenses": "r5kz-chrr",
"business_owners": "ezma-pppn"}}'
IMPORT {
SELECT
bo.owner_first_name,
bo.owner_last_name,
bl.legal_name,
bl.address,
bl.city,
bl.license_id,
bl.expiration_date
FROM business_licenses bl
JOIN business_owners bo
ON bl.account_number = bo.account_number
WHERE bl.license_description = 'Tavern'
AND bl.expiration_date > '2020-05-01'
} AS chicago_bars
This Splitfile mounts two Socrata datasets: Chicago Business Licenses and Chicago Business Owners and runs an IMPORT
to join between them and snapshot the data.
The result is a list of all bar owners in Chicago with current liquor licenses.
Exploring data with DBeaver
The Socrata Discover API in conjunction with Splitgraph's mounting capabilities allows you to explore Socrata data from any SQL client.
First, mount Socrata open data endpoints for Chicago, IL and Cambridge, MA:
$ sgr mount socrata chicago -o '{"domain": "data.cityofchicago.org"}'
Connecting to remote server...
Mounting Socrata domain...
Getting Socrata metadata
warning: Requests made without an app_token will be subject to strict throttling limits.
Loaded metadata for 504 Socrata tables
$ sgr mount socrata cambridge -o '{"domain": "data.cambridgema.gov"}'
Connecting to remote server...
Mounting Socrata domain...
Getting Socrata metadata
warning: Requests made without an app_token will be subject to strict throttling limits.
Loaded metadata for 137 Socrata tables
This doesn't actually load any data and only loads metadata for all available datasets, creating PostgreSQL foreign tables for each one of them. Such tables act as lightweight shims and most clients will treat them like normal tables (sometimes introspection may be different, or they may show up in a list of "foreign tables" rather than "tables").
Open a PostgreSQL client (instructions are available for DataGrip, pgAdmin or other clients like pgcli or DBeaver) and connect to the Splitgraph engine. In these screenshots, DBeaver is used.
Chicago data is now mounted in the chicago
schema and Cambridge in cambridge
. Every table maps
to a Socrata dataset, with the name being automatically generated from the dataset name and the
Socrata ID.
Splitgraph also grabs descriptions for each dataset and every column from Socrata metadata,
if it exists. These descriptions are added as PostgreSQL COMMENT
entries on tables and columns.
Splitgraph loads mounted data on demand, similar to layered querying. It only sends queries to Socrata when the client introspects or queries a table.
In addition, you can use filters, ORDER
, LIMIT
and OFFSET
clauses in your queries, which Splitgraph will rewrite into SoQL queries when sending them to Socrata, so your local Splitgraph instance
doesn't need to load the whole dataset.
Using Metabase to join and plot data from multiple data portals
In this section, we'll use Metabase, a powerful visualization and dashboarding tool, to plot a table that joins data from two distinct Socrata data portals. Normally, this would involve some ETL jobs and a Jupyter notebook, but with Splitgraph, we'll be able to do this directly in one simple SQL query.
Follow the Metabase integration guide to set up Metabase with Splitgraph and make sure the Chicago and Cambridge data endpoints are mounted (see previous section).
First, let's try plotting the COVID-19 Daily Cases and Deaths Socrata dataset (Socrata ID naz8-j4nc
):
SELECT
lab_report_date, cases_total
FROM
chicago.covid19_daily_cases_deaths_and_hospitalizations_naz8_j4nc chicago_cases
ORDER BY lab_report_date ASC;
Note that this is run against the chicago
schema that has all available tables from the
data.cityofchicago.org
Socrata domain mounted. You can also
mount just the table you're interested in and give it a more readable name:
$ sgr mount socrata -o '{"domain": "data.cityofchicago.org", \
"tables": {"covid_cases": "naz8-j4nc"}}'
Now, let's compare new daily cases in Chicago and Cambridge. You can do this with a simple JOIN
. Of course referential integrity is not guaranteed to exist, but assuming the data is properly cleaned, we just have to make sure both tables are in terms of the date, rather than the timestamp, so that they can be matched up:
(Note: If you do need to clean data, you could always load it into a Splitgraph image, perform the necessary modifications on it, and then use that image as the source instead of the mounted data).
SELECT
cambridge_cases.date AS date,
chicago_cases.cases_total AS chicago_daily_cases,
cambridge_cases.new_positive_cases AS cambridge_daily_cases
FROM
chicago.covid19_daily_cases_deaths_and_hospitalizations_naz8_j4nc chicago_cases
FULL OUTER JOIN
cambridge.covid19_case_count_by_date_axxk_jvk8 cambridge_cases
ON
date_trunc('day', chicago_cases.lab_report_date) = cambridge_cases.date::timestamp
ORDER BY date ASC;
Behind the scenes, this queries the Chicago COVID-19 Daily Cases and Deaths and the Cambridge COVID-19 Cumulative Cases by Date Socrata datasets using SoQL and the actual JOIN is performed in PostgreSQL. However, Metabase and any other PostgreSQL client doesn't need to know about that; to them, these look like normal PostgreSQL tables.
Licensing and contact
Socrata datasets are not hosted by Splitgraph (although many are indexed by Splitgraph) and the Splitgraph engine queries them through the public Socrata API. While most datasets hosted by Socrata are in the public domain, you should always check the dataset's metadata for support, licensing information and contact information.