What is Splitgraph?
Splitgraph is building the Unified Data Stack – an integrated and modern solution for working with data without worrying about its infrastructure.
You can try Splitgraph now! Browse the public data catalog to discover 40,000+ repositories to query with SQL, through the web IDE or any Postgres client.
Splitgraph is powered by open standards and simple abstractions, like data images – immutable tables that you can push and pull, or query on the fly.
We're not replacing SQL as our favorite query language anytime soon 😄, but we rely on GraphQL to implement many of Splitgraph's internal APIs. Thanks to GraphQL schema stitching, we were able to add new backend services which extend the set of fields published by older services. Clients don't need to specify which service provides which field, this is determined automatically by the schema stitching logic.
How we use GraphQL at Splitgraph
Projects like Apollo Client, graphql-playground and PostGraphile have made working with GraphQL a smooth experience, but the declarative query language has proven to be even more important than the tooling. It's given us the flexibility to develop services which can be used independently, but also as components implementing the following unified API:
type Query {
"""Returns a single `Repository` identified
by it's `namespace` and `repositoryName`."""
metadataRepository(
namespace: String!,
repositoryName: String!): Repository
authorizationRepository(
namespace: String!,
repositoryName: String!): Repository
"""Returns a single `User` based on their username."""
authorizationUser(username: String!): User
metadataUser(username: String!): User
}
type Repository {
namespace: String!,
repositoryName: String!,
description: String,
url: String,
"""Lists all users who have permission to read this repository."""
usersWithReadAccess: [User!]!
}
type User {
username: String!,
fullName: String,
"""Lists all repositories this user has permission to read."""
repositories: [Repository!]!
}
The "dashboard" page which greets users upon login is one of the many clients of this API. It needs to display the current user's name as well a link to each repository they have access to. A single GraphQL query can fetch the required fields:
query Dashboard($username: String!) {
metadataUser(username: $username) {
fullName,
repositories {
description,
url
}
}
}
Schema stitching in the gateway service
Two separate services store the data requested by the Dashboard
query. The client sends the query to the gateway
service, which uses schema stitching to combine the APIs of the backing metadata
and authorization
services into the unified schema shown above. As the gateway
service executes the Dashboard
query, it queries the backing services.
Metadata service
The metadata
service is powered by PostGraphile and predates the authorization
service. Before developing the latter, Splitgraph's internal GraphQL API Schema looked similar to the following:
type Repository {
namespace: String!
repositoryName: String!
description: String
url: String
}
type User {
username: String!
fullName: String
}
type Query {
metadataRepository(
namespace: String!,
repositoryName: String!): Repository
metadataUser(username: String!): User
}
Authorization service
We built the authorization
service based on Ory Keto to keep track of user-repository permissions.
The authorization
service's schema extends the types introduced by the metadata
service:
"""This Repository type is merged with the Repository
type declared in the metadata service schema."""
type Repository {
"""namespace and repositoryName are the key fields
by which the repository types in the two
subschemas can be merged."""
namespace: String!
repositoryName: String!
usersWithReadAccess: [User!]!
}
"""Similar to Repository, User will be merged
with the metadata service's User type."""
type User {
"""username is the key field of User."""
username: String!
repositories: [Repository!]!
}
type Query {
"""authorization service specific fields for retrieving
User and Repository objects."""
authorizationRepository(
namespace: String!,
repositoryName: String!): Repository
authorizationUser(username: String!): User
}
What is schema stitching?
The main idea is simple: the different fields of a type may be distributed among several services as long as the corresponding subschemas all use the same type name. By default, the "stitched" schema just combines all the fields of any given type from all subschemas, but accidental type merging can be avoided. Single-service GraphQL schemas operate under the closed world assumption that all fields belonging to a type are declared in the schema. Stitching leads to an open world model where any type may be extended with additional fields declared in a new subschema.
At first, the doubled fields in the Query
type may seem superfluous. They have the same return types after all: User
and Repository
.
Why they're necessary becomes apparent as soon as we examine how the Dashboard
query is processed by the gateway
service.
Querying stitched schemas
The gateway
service uses the merging flow to execute queries and delegate to the other services.
With the proper subschema configuration, the schema stitching code can determine which service should be queried for a particular field.
Consider the steps required to compute results for the Dashboard
query for user mrDorp
.
The client submits the
Dashboard
query to the gateway service, which selects thefullName
andrepositories
fields on the result ofmetadataUser
.Since
metadataUser
comes from themetadata
service's subschema, thegateway
queries it for thefullName
field.metadataUser(username: "mrDorp") { __typename fullName username }
The
username
field is added implicitly since it's the key field used to join theUser
types in the two services.__typename
is always added by the schema stitching code. The response to the query is:{ "data": { "metadataUser": { "__typename": "User", "username": "mrDorp", "fullName": "Ralph Dorp" } }
The
repositories
field of theUser
type belongs to theauthorization
service, so thegateway
queries it next. The subschema configuration specifies that the top-levelauthorizationUser
field may be queried to get theUser
fields declared in the service's subschema.authorizationUser(username: "mrDorp") { username repositories { __typename namespace repositoryName } }
Note that the
Dashboard
query selected thedescription
andurl
fields of eachRespository
object, but these fields are declared by themetadata
service schema. In the request headed for theauthorization
service, all that can be queried are the key fields. Just as in the previous query, thegateway
service implicity adds theusername
key field to the selection set. The response is the following:{ "data": { "authorizationUser": { "__typename": "User", "username": "mrDorp", "repositories": [ { "__typename": "Repository", "namespace": "austintexas-gov", "repositoryName": "austin-high-school-graduation-rates-xeb7-q8v3" }, { "__typename": "Repository", "namespace": "bts-gov", "repositoryName": "county-transportation-profiles-qdmf-cxm3" } ] } }
Having obtained the repositories' key fields, the
gateway
may query additionalRepository
fields from themetadata
service. It consults the stitching configuration for the top-level field to query - in this casemetadataRepository
.metadataRespository( namespace: "austintexas-gov", repositoryName: "austin-high-school-graduation-rates-xeb7-q8v3") { __typename namespace repositoryName description url } metadataRespository( namespace: "bts-gov", repositoryName: "county-transportation-profiles-qdmf-cxm3") { __typename namespace repositoryName description url }
Just as with the query for the
authorizationUser
field, the key fields (namespace
andrepositoryName
) and__typename
are implicitly added to the query to allow merging of result objects. We created a PostGraphile plugin using GraphQL's DataLoader to combine the two queries above into a single GraphQL query (drop us a line if you're interested in using it). Unfortunately, the current schema does not allow for such optimization, so each repository's metadata fields are queried separately. Themetadata
service responds with:{ "data": { "metadataRepository": { "__typename": "Repository", "namespace": "austintexas-gov", "repositoryName": "austin-high-school-graduation-rates-xeb7-q8v3" "description": "Graduation rates for Austin high schools for years 2012 to 2016 provided by the Texas Education Agency.", "url": "https://www.splitgraph.com/austintexas-gov/austin-high-school-graduation-rates-xeb7-q8v3" } } } { "data": { "metadataRepository": { "__typename": "Repository", "namespace": "bts-gov", "repositoryName": "county-transportation-profiles-qdmf-cxm3" "description": "Profiles of transportation features of U.S. counties", "url": "https://www.splitgraph.com/bts-gov/county-transportation-profiles-qdmf-cxm3" } } }
The
gateway
has all the fields required to respond to theDashboard
query. Key fields which weren't selected in theDashboard
query are discarded once the objects have been merged.{ "data": { "metadataUser": { "__typename": "User", "fullName": "Ralph Dorp", "repositories": [ { "__typename": "Repository", "description": "Graduation rates for Austin high schools for years 2012 to 2016 provided by the Texas Education Agency.", "url": "https://www.splitgraph.com/austintexas-gov/austin-high-school-graduation-rates-xeb7-q8v3" }, { "__typename": "Repository", "description": "Profiles of transportation features of U.S. counties", "url": "https://www.splitgraph.com/bts-gov/county-transportation-profiles-qdmf-cxm3" } ] } } }
Configuring subschemas to stitch
While processing the Dashboard
query, the gateway
had to decide which service to query for each requested field. This is determined by the subschema configuration.
Below is the simplified code for the gateway
service, which is based on the TypeScript examples for remote schemas and schema stitching:
import { stitchSchemas } from "@graphql-tools/stitch";
import { fetch } from "cross-fetch";
import { print, GraphQLSchema } from "graphql";
import { introspectSchema, wrapSchema } from "@graphql-tools/wrap";
import type { AsyncExecutor } from "@graphql-tools/utils/executor";
import { SubschemaConfig } from "@graphql-tools/delegate";
const authorizationServiceUrl = "http://api.splitgrph.com/authorzation/graphql";
const metadataServiceUrl = "http://api.splitgrph.com/metadata/graphql";
const makeRemoteExecutor: (serviceUrl: string) => AsyncExecutor = (
serviceUrl: string
) => async ({ document, variables }) => {
const query = print(document);
const result = await fetch(serviceUrl, {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({ query, variables }),
});
return result.json();
};
const defineSubSchema = async (
serviceUrl: string,
merge?: SubschemaConfig["merge"]
): Promise<SubschemaConfig> => {
const executor = makeRemoteExecutor(serviceUrl);
const schema = wrapSchema({
schema: await introspectSchema(executor),
executor,
});
return { executor, schema, merge };
};
export const makePublicSchema = async (): Promise<GraphQLSchema> =>
stitchSchemas({
subschemas: [
await defineSubSchema(authorizationServiceUrl, {
User: {
fieldName: "authorizationUser",
selectionSet: "{ username }",
args: ({ username }) => ({ username }),
},
Repository: {
fieldName: "authorizationRepository",
selectionSet: "{ namespace repositoryName }",
args: ({ namespace, repositoryName }) => ({
namespace,
repositoryName,
}),
},
}),
await defineSubSchema(metadataServiceUrl, {
User: {
fieldName: "metadataUser",
selectionSet: "{ username }",
args: ({ username }) => ({ username }),
},
Repository: {
fieldName: "metadataRepository",
selectionSet: "{ namespace repositoryName }",
args: ({ namespace, repositoryName }) => ({
namespace,
repositoryName,
}),
},
}),
],
});
The GraphQLSchema
object returned by makePublicSchema
can be used to create an HTTP GraphQL endpoint.
In order to query a subschema, it's URL must be known. Conveniently, a GraphQL query can be used to obtain an endpoint's schema. This is what defineSubSchema()
does when it calls introspectSchema()
. The object passed to defineSubSchema()
is the merge configuration. Consider the fragment,
await defineSubSchema(authorizationServiceUrl, {
User: {
fieldName: "authorizationUser",
selectionSet: "{ username }",
args: ({ username }) => ({ username }),
}
It declares: when fields of the User
type are queried which were declared by the authorization
service schema, they can be obtained by passing the username
field of the pre-existing User
object as the argument to the top-level authorizationUser
field.
This was employed in step 2 of the query process described earlier, when the User.repository
field was merged with the User
object obtained from the metadata
service in the previous step.
fieldName
specifies the field of the top-level Query
type to consult. selectionSet
defines the set of fields to be selected from the existing User
object to get authorizationUser
's arguments, typically the key field or fields. selectionSet
could contain fields not yet available on the current User
instance. In such a case, an additional subschema query would be made before the field referred to by fieldName
is queried.
The args
function transforms the existing fields on the User
object to be used as arguments to authorizationUser
. In most cases, this is the identity function, but it can be useful for things like string to number conversion, especially when stitching schemas for APIs outside of our control. For example, one could extend the types in the official Shopify GraphQL Schema with custom fields routed to an internal service.
It would also be possible to start by querying the authorization
service:
query Dashboard2($username: String!) {
authorizationUser(username: $username) {
fullName,
repositories {
description,
url
}
}
}
The steps to executing the Dashboard2
query would be:
- Query the
authorization
service for theusername
key field and therepositories
field. - Merge the
User.fullName
field to the existingUser
object instance by queryingmetadataUser
. - Merge
url
anddescription
fields with the existingRepository
objects usingmetadataRepository
.
Conclusion
Schema stitching has enabled us to extend an existing service's GraphQL API with new fields served by a new service. The "stitching" is seamless in the sense that clients don't need to know which field belongs to which service's subschema. We evolved our API without affecting earlier queries.