splitgraph.core package

Submodules

splitgraph.core.engine module

Routines for managing Splitgraph engines, including looking up repositories and managing objects.

splitgraph.core.engine.get_current_repositories(engine)

Lists all repositories currently in the engine.

Parameters:engine – Engine
Returns:List of (Repository object, current HEAD image)
splitgraph.core.engine.init_engine()

Initializes the engine by:

  • performing any required engine-custom initialization
  • creating the metadata tables
splitgraph.core.engine.lookup_repository(name, include_local=False)

Queries the SG drivers on the lookup path to locate one hosting the given repository.

Parameters:
  • name – Repository name
  • include_local – If True, also queries the local engine
Returns:

Local or remote Repository object

splitgraph.core.engine.repository_exists(repository)

Checks if a repository exists on the engine.

Parameters:repository – Repository object

splitgraph.core.image module

Image representation and provenance

class splitgraph.core.image.Image

Bases: splitgraph.core.image.Image

Represents a Splitgraph image. Should’t be created directly, use Image-loading methods in the splitgraph.core.repository.Repository class instead.

checkout(keep_downloaded_objects=True, force=False)

Checks the image out, changing the current HEAD pointer. Raises an error if there are pending changes to its checkout.

Parameters:
  • keep_downloaded_objects – If False, deletes externally downloaded objects after they’ve been used.
  • force – Discards all pending changes to the schema.
delete_tag(tag)

Deletes a tag from an image.

Parameters:tag – Tag to delete.
get_log()

Repeatedly gets the parent of a given image until it reaches the bottom.

get_parent_children()

Gets the parent and a list of children of a given image.

get_table(table_name)

Returns a Table object representing a version of a given table. Contains a list of objects that the table is linked to: a DIFF object (beginning a chain of DIFFs that describe a table), a SNAP object (a full table copy), or both.

Parameters:table_name – Name of the table
Returns:Table object or None
get_tables()

Gets the names of all tables inside of an image.

get_tags()

Lists all tags that this image has.

provenance()

Inspects the image’s parent chain to come up with a set of repositories and their hashes that it was created from.

Returns:List of (repository, image_hash)
set_provenance(provenance_type, **kwargs)

Sets the image’s provenance. Internal function called by the Splitfile interpreter, shouldn’t be called directly as it changes the image after it’s been created.

Parameters:
  • provenance_type – One of “SQL”, “MOUNT”, “IMPORT” or “FROM”
  • kwargs – Extra provenance-specific arguments
tag(tag, force=False)

Tags a given image. All tags are unique inside of a repository.

Parameters:
  • tag – Tag to set. ‘latest’ and ‘HEAD’ are reserved tags.
  • force – Whether to remove the old tag if an image with this tag already exists.
to_splitfile(err_on_end=True, source_replacement=None)

Crawls the image’s parent chain to recreates a Splitfile that can be used to reconstruct it.

Parameters:
  • err_on_end – If False, when an image with no provenance is reached and it still has a parent, then instead of raising an exception, it will base the Splitfile (using the FROM command) on that image.
  • source_replacement – A dictionary of repositories and image hashes/tags specifying how to replace the dependencies of this Splitfile (table imports and FROM commands).
Returns:

A list of Splitfile commands that can be fed back into the executor.

splitgraph.core.object_manager module

Functions related to creating, deleting and keeping track of physical Splitgraph objects.

class splitgraph.core.object_manager.ObjectManager(object_engine)

Bases: object

A Splitgraph metadata-aware class that keeps track of objects on a given engine. Backed by ObjectEngine to move physical objects around and run metadata queries.

cleanup(include_external=False)

Deletes all local objects not required by any current mountpoint, including their dependencies, their remote locations and their cached local copies.

Parameters:include_external – If True, deletes all external objects cached locally and redownloads them when they’re needed.
delete_objects(objects)

Deletes objects from the Splitgraph cache

Parameters:objects – A sequence of objects to be deleted
download_objects(source, objects_to_fetch, object_locations)

Fetches the required objects from the remote and stores them locally. Does nothing for objects that already exist.

Parameters:
  • source – Remote ObjectManager
  • objects_to_fetch – List of object IDs to download.
  • object_locations – List of custom object locations, encoded as tuples (object_id, object_url, protocol).
Returns:

Set of object IDs that were fetched.

extract_recursive_object_meta(remote, table_meta)

Recursively crawl the a remote object manager in order to fetch all objects required to materialize tables specified in table_meta that don’t yet exist on the local engine.

get_downloaded_objects()

Gets a list of objects currently in the Splitgraph cache (i.e. not only existing externally.)

Returns:Set of object IDs.
get_existing_objects()

Gets all objects currently in the Splitgraph tree.

Returns:Set of object IDs.
get_external_object_locations(objects)

Gets external locations for objects.

Parameters:objects – List of objects stored externally.
Returns:List of (object_id, location, protocol).
get_full_object_tree()

Returns a list of (object_id, parent_id, SNAP/DIFF) with the full object tree in the engine

get_image_object_path(table)

Calculates a list of objects SNAP, DIFF, … , DIFF that are used to reconstruct a table.

Parameters:table – Table object
Returns:A tuple of (SNAP object, list of DIFF objects in reverse order (latest object first))
get_object_meta(objects)

Get metadata for multiple Splitgraph objects from the tree

Parameters:objects – List of objects to get metadata for.
Returns:List of (object_id, format, parent_id, namespace).
record_table_as_diff(old_table, image_hash)

Flushes the pending changes from the audit table for a given table and records them, registering the new objects.

Parameters:
  • old_table – Table object pointing to the current HEAD table
  • image_hash – Image hash to store the table under
record_table_as_snap(repository, table_name, image_hash)

Copies the full table verbatim into a new Splitgraph SNAP object, registering the new object.

Parameters:
  • repository – Repository
  • table_name – Table name
  • image_hash – Hash of the new image
register_object(object_id, object_format, namespace, parent_object=None)

Registers a Splitgraph object in the object tree

Parameters:
  • object_id – Object ID
  • object_format – Format (SNAP or DIFF)
  • namespace – Namespace that owns the object. In registry mode, only namespace owners can alter or delete objects.
  • parent_object – Parent that the object depends on, if it’s a DIFF object.
register_object_locations(object_locations)

Registers external locations (e.g. HTTP or S3) for Splitgraph objects. Objects must already be registered in the object tree.

Parameters:object_locations – List of (object_id, location, protocol).
register_objects(object_meta, namespace=None)

Registers multiple Splitgraph objects in the tree. See register_object for more information.

Parameters:
  • object_meta – List of (object_id, format, parent_id, namespace).
  • namespace – If specified, overrides the original object namespace, required in the case where the remote repository has a different namespace than the local one.
register_table(repository, table, image, object_id)

Registers the object that represents a Splitgraph table inside of an image.

Parameters:
  • repository – Repository
  • table – Table name
  • image – Image hash
  • object_id – Object ID to register the table to.
register_tables(repository, table_meta)

Links tables in an image to physical objects that they are stored as. Objects must already be registered in the object tree.

Parameters:
  • repository – Repository that the tables belong to.
  • table_meta – A list of (image_hash, table_name, object_id).
upload_objects(target, objects_to_push, handler='DB', handler_params=None)

Uploads physical objects to the remote or some other external location.

Parameters:
  • target – Target ObjectManager
  • objects_to_push – List of object IDs to upload.
  • handler – Name of the handler to use to upload objects. Use DB to push them to the remote, FILE to store them in a directory that can be accessed from the client and HTTP to upload them to HTTP.
  • handler_params – For HTTP, a dictionary {“username”: username, “password”, password}. For FILE, a dictionary {“path”: path} specifying the directory where the objects shall be saved.
Returns:

A list of (object_id, url, handler) that specifies all objects were uploaded (skipping objects that already exist on the remote).

splitgraph.core.object_manager.get_random_object_id()

Assign each table a random ID that it will be stored as. Note that postgres limits table names to 63 characters, so the IDs shall be 248-bit strings, hex-encoded, + a letter prefix since Postgres doesn’t seem to support table names starting with a digit.

splitgraph.core.object_manager.getrandbits(k) → x. Generates an int with k random bits.

splitgraph.core.registry module

Functions for communicating with the remote Splitgraph catalog

splitgraph.core.registry.get_info_key(engine, key)

Gets a configuration key from the remote registry, used to notify the client of the registry’s capabilities.

Parameters:
  • engine – Engine
  • key – Key to get
splitgraph.core.registry.get_published_info(repository, tag)

Get information on an image that’s published in a catalog.

Parameters:
  • repository – Repository
  • tag – Image tag
Returns:

A tuple of (image_hash, published_timestamp, provenance, readme, table schemata, previews)

splitgraph.core.registry.publish_tag(repository, tag, image_hash, published, provenance, readme, schemata, previews)

Publishes a given tag in the remote catalog. Should’t be called directly. Use splitgraph.commands.publish instead.

Parameters:
  • repository – Remote (!) Repository object
  • tag – Tag to publish
  • image_hash – Image hash corresponding to the given tag.
  • published – Publish time (datetime)
  • provenance – A list of tuples (repository, image_hash) showing what the image was created from
  • readme – An optional README for the repo
  • schemata – Dict mapping table name to a list of (column name, column type)
  • previews – Dict mapping table name to a list of tuples with a preview
splitgraph.core.registry.set_info_key(engine, key, value)

Sets a configuration value on the remote registry.

Parameters:
  • engine – Engine
  • key – Key to set
  • value – New value for the key
splitgraph.core.registry.setup_registry_mode(engine)

Drops tables in splitgraph_meta that aren’t pertinent to the registry + sets up access policies/RLS:

  • Normal users aren’t allowed to create tables/schemata (can’t do checkouts inside of a registry or upload SG objects directly to it)
  • images/tables/tags meta tables: can only create/update/delete records where the namespace = user ID
  • objects/object_location tables: same. An object (piece of data) becomes owned by the user that creates it and still remains so even if someone else’s image starts using it. Hence, the original owner can delete or change it (since they control the external location they’ve uploaded it to anyway).
splitgraph.core.registry.toggle_registry_rls(engine, mode='ENABLE')

Switches row-level security on the registry, restricting write access to metadata tables to owners of relevant repositories/objects.

Parameters:
  • engine – Engine
  • mode – ENABLE, DISABLE or FORCE (enable for superusers/table owners)
splitgraph.core.registry.unpublish_repository(repository)

Deletes the repository from the remote catalog.

Parameters:repository – Repository to unpublish

splitgraph.core.repository module

Public API for managing images in a Splitgraph repository.

class splitgraph.core.repository.ImageManager(repository)

Bases: object

Collects various image-related functions.

add(parent_id, image, created=None, comment=None, provenance_type=None, provenance_data=None)

Registers a new image in the Splitgraph image tree.

Internal method used by actual image creation routines (committing, importing or pulling).

Parameters:
  • parent_id – Parent of the image
  • image – Image hash
  • created – Creation time (defaults to current timestamp)
  • comment – Comment (defaults to empty)
  • provenance_type – Image provenance that can be used to rebuild the image (one of None, FROM, MOUNT, IMPORT, SQL)
  • provenance_data – Extra provenance data (dictionary).
by_hash(image_hash, raise_on_none=True)

Returns an image corresponding to a given (possibly shortened) image hash. If the image hash is ambiguous, raises an error. If the image does not exist, raises an error or returns None.

Parameters:
  • image_hash – Image hash (can be shortened).
  • raise_on_none – Whether to raise if the image doesn’t exist.
Returns:

Image object or None

by_tag(tag, raise_on_none=True)

Returns an image with a given tag

Parameters:
  • tag – Tag. ‘latest’ is a special case: it returns the most recent image in the repository.
  • raise_on_none – Whether to raise an error or return None if the tag doesn’t exist.
delete(images)

Deletes a set of Splitgraph images from the repository. Note this doesn’t check whether this will orphan some other images in the repository and can make the state of the repository invalid.

Image deletions won’t be replicated on push/pull (those can only add new images).

Parameters:images – List of image IDs
get_all_child_images(start_image)

Get all children of start_image of any degree.

get_all_parent_images(start_images)

Get all parents of the ‘start_images’ set of any degree.

class splitgraph.core.repository.Repository(namespace, repository, engine=None)

Bases: object

Splitgraph repository API

commit(image_hash=None, include_snap=False, comment=None)

Commits all pending changes to a given repository, creating a new image.

Parameters:
  • image_hash – Hash of the commit. Chosen by random if unspecified.
  • include_snap – If True, also creates a SNAP object with a full copy of the table. This will speed up checkouts, but consumes extra space.
  • comment – Optional comment to add to the commit.
Returns:

The newly created Image object.

delete_upstream()

Deletes the upstream remote + repository for a local repository.

diff(table_name, image_1, image_2, aggregate=False)

Compares the state of a table in different images. If the two images are on the same path in the commit tree, it doesn’t need to materialize any of the tables and simply aggregates their DIFF objects to produce a complete changelog. Otherwise, it materializes both tables into a temporary space and compares them row-to-row.

Parameters:
  • table_name – Name of the table.
  • image_1 – First image hash / object. If None, uses the state of the current staging area.
  • image_2 – Second image hash / object. If None, uses the state of the current staging area.
  • aggregate – If True, returns a tuple of integers denoting added, removed and updated rows between the two images.
Returns:

If the table doesn’t exist in one of the images, returns True if it was added and False if it was removed. If aggregate is True, returns the aggregation of changes as specified before. Otherwise, returns a list of changes where each change is of the format (primary key, action_type, action_data):

  • action_type == 0 is Insert and the action_data contains a dictionary of non-PK columns and values
    inserted.
  • action_type == 1: Delete, action_data is None.
  • action_type == 2: Update, action_data is a dictionary of non-PK columns and their new values for
    that particular row.

classmethod from_schema(schema)

Convert a Postgres schema name of the format namespace/repository to a Splitgraph repository object.

classmethod from_template(template, namespace=None, repository=None, engine=None)

Create a Repository from an existing one replacing some of its attributes.

get_all_hashes_tags()

Gets all tagged images and their hashes in a given repository.

Returns:List of (image_hash, tag)
get_upstream()

Gets the current upstream repository that a local repository tracks

Returns:Remote Repository object (with a remote engine)
has_pending_changes()

Detects if the repository has any pending changes (schema changes, table additions/deletions, content changes).

head

Return the HEAD image for the repository or None if the repository isn’t checked out.

import_tables(tables, source_repository, source_tables, image_hash=None, foreign_tables=False, do_checkout=True, target_hash=None, table_queries=None)

Creates a new commit in target_repository with one or more tables linked to already-existing tables. After this operation, the HEAD of the target repository moves to the new commit and the new tables are materialized.

Parameters:
  • tables – If not empty, must be the list of the same length as source_tables specifying names to store them under in the target repository.
  • source_repository – Repository to import tables from.
  • source_tables – List of tables to import. If empty, imports all tables.
  • image_hash – Image hash in the source repository to import tables from. Uses the current source HEAD by default.
  • foreign_tables – If True, copies all source tables to create a series of new SNAP objects instead of treating them as Splitgraph-versioned tables. This is useful for adding brand new tables (for example, from an FDW-mounted table).
  • do_checkout – If False, doesn’t materialize the tables in the target mountpoint.
  • target_hash – Hash of the new image that tables is recorded under. If None, gets chosen at random.
  • table_queries – If not [], it’s treated as a Boolean mask showing which entries in the tables list are instead SELECT SQL queries that form the target table. The queries have to be non-schema qualified and work only against tables in the source repository. Each target table created is the result of the respective SQL query. This is committed as a new snapshot.
Returns:

Hash that the new image was stored under.

init()

Initializes an empty repo with an initial commit (hash 0000…)

materialized_table(table_name, image_hash)

A context manager that returns a pointer to a read-only materialized table in a given image. If the table is already stored as a SNAP, this doesn’t use any extra space. Otherwise, the table is materialized and deleted on exit from the context manager.

Parameters:
  • table_name – Name of the table
  • image_hash – Image hash to materialize
Returns:

(schema, table_name) where the materialized table is located. The table must not be changed, as it might be a pointer to a real SG SNAP object.

publish(tag, remote_repository=None, readme='', include_provenance=True, include_table_previews=True)

Summarizes the data on a previously-pushed repository and makes it available in the catalog.

Parameters:
  • tag – Image tag. Only images with tags can be published.
  • remote_repository – Remote Repository object (uses the upstream if unspecified)
  • readme – Optional README for the repository.
  • include_provenance – If False, doesn’t include the dependencies of the image
  • include_table_previews – Whether to include data previews for every table in the image.
pull(download_all=False)

Synchronizes the state of the local Splitgraph repository with its upstream, optionally downloading all new objects created on the remote.

Parameters:download_all – If True, downloads all objects and stores them locally. Otherwise, will only download required objects when a table is checked out.
push(remote_repository=None, handler='DB', handler_options=None)

Inverse of pull: Pushes all local changes to the remote and uploads new objects.

Parameters:
  • remote_repository – Remote repository to push changes to. If not specified, the current upstream is used.
  • handler – Name of the handler to use to upload objects. Use DB to push them to the remote or S3 to store them in an S3 bucket.
  • handler_options – Extra options to pass to the handler. For example, see splitgraph.hooks.s3.S3ExternalObjectHandler.
rm(unregister=True, uncheckout=True)

Discards all changes to a given repository and optionally all of its history, as well as deleting the Postgres schema that it might be checked out into. Doesn’t delete any cached physical objects.

After performing this operation, this object becomes invalid and must be discarded, unless init() is called again.

Parameters:
  • unregister – Whether to purge repository history/metadata
  • uncheckout – Whether to delete the actual checked out repo
run_sql(sql, arguments=(), return_shape=<ResultShape.MANY_MANY: 4>)

Execute an arbitrary SQL statement inside of this repository’s checked out schema.

set_tags(tags, force=False)

Sets tags for multiple images.

Parameters:
  • tags – List of (image_hash, tag)
  • force – Whether to remove the old tag if an image with this tag already exists.
set_upstream(remote_repository)

Sets the upstream remote + repository that this repository tracks.

Parameters:remote_repository – Remote Repository object
to_schema()

Returns the engine schema that this repository gets checked out into.

uncheckout(force=False)

Deletes the schema that the repository is checked out into

Parameters:force – Discards all pending changes to the schema.
splitgraph.core.repository.clone(remote_repository, local_repository=None, download_all=False)

Clones a remote Splitgraph repository or synchronizes remote changes with the local ones.

If the target repository has no set upstream engine, the source repository becomes its upstream.

Parameters:
  • remote_repository – Remote Repository object to clone or the repository’s name. If a name is passed, the repository will be looked up on the current lookup path in order to find the engine the repository belongs to.
  • local_repository – Local repository to clone into. If None, uses the same name as the remote.
  • download_all – If True, downloads all objects and stores them locally. Otherwise, will only download required objects when a table is checked out.
Returns:

A locally cloned Repository object.

splitgraph.core.repository.find_path(repository, hash_1, hash_2)

If the two images are on the same path in the commit tree, returns that path.

splitgraph.core.repository.getrandbits(k) → x. Generates an int with k random bits.
splitgraph.core.repository.import_table_from_remote(remote_repository, remote_tables, remote_image_hash, target_repository, target_tables, target_hash=None)

Shorthand for importing one or more tables from a yet-uncloned remote. Here, the remote image hash is required, as otherwise we aren’t necessarily able to determine what the remote head is.

Parameters:
  • remote_repository – Remote Repository object
  • remote_tables – List of remote tables to import
  • remote_image_hash – Image hash to import the tables from
  • target_repository – Target repository to import the tables to
  • target_tables – Target table aliases
  • target_hash – Hash of the image that’s created with the import. Default random.
splitgraph.core.repository.table_exists_at(repository, table_name, image=None)

Determines whether a given table exists in a Splitgraph image without checking it out. If image_hash is None, determines whether the table exists in the current staging area.

splitgraph.core.table module

class splitgraph.core.table.Table(repository, image, table_name, objects)

Bases: object

Represents a Splitgraph table in a given image. Shouldn’t be created directly, use Table-loading methods in the splitgraph.core.image.Image class instead.

get_object(object_type)

Get the physical object ID of a given type that this table is linked to

Parameters:object_type – Either SNAP or DIFF
Returns:Object ID or None if an object of such type doesn’t exist
get_schema()

Gets the schema of a given table

Returns:The table schema. See the documentation for get_full_table_schema for the spec.
materialize(destination, destination_schema=None)

Materializes a Splitgraph table in the target schema as a normal Postgres table, potentially downloading all required objects and using them to reconstruct the table.

Parameters:
  • destination – Name of the destination table.
  • destination_schema – Name of the destination schema.
Returns:

A set of IDs of downloaded objects used to construct the table.

Module contents

Core Splitgraph functionality: versioning and sharing tables.

The main point of interaction with the Splitgraph API is a splitgraph.core.repository.Repository object representing a local or a remote Splitgraph repository. Repositories can be created using one of the following methods:

  • Directly by invoking Repository(namespace, name, engine) where engine is the engine that the repository belongs to (that can be gotten with get_engine(engine_name). If the created repository doesn’t actually exist on the engine, it must first be initialized with repository.init().
  • By using splitgraph.core.engine.lookup_repository() which will search for the repository on the current lookup path.