splitgraph.core package

Submodules

splitgraph.core.engine module

Routines for managing Splitgraph engines, including looking up repositories and managing objects.

splitgraph.core.engine.get_current_repositories(engine)

Lists all repositories currently in the engine.

Parameters:engine – Engine
Returns:List of (Repository object, current HEAD image)
splitgraph.core.engine.init_engine()

Initializes the engine by:

  • performing any required engine-custom initialization
  • creating the metadata tables
splitgraph.core.engine.lookup_repository(name, include_local=False)

Queries the SG drivers on the lookup path to locate one hosting the given repository.

Parameters:
  • name – Repository name
  • include_local – If True, also queries the local engine
Returns:

Local or remote Repository object

splitgraph.core.engine.repository_exists(repository)

Checks if a repository exists on the engine.

Parameters:repository – Repository object

splitgraph.core.image module

Image representation and provenance

class splitgraph.core.image.Image

Bases: splitgraph.core.image.Image

Represents a Splitgraph image. Should’t be created directly, use Image-loading methods in the splitgraph.core.repository.Repository class instead.

checkout(force=False, layered=False)

Checks the image out, changing the current HEAD pointer. Raises an error if there are pending changes to its checkout.

Parameters:
  • force – Discards all pending changes to the schema.
  • layered – If True, uses layered querying to check out the image (doesn’t materialize tables inside of it).
delete_tag(tag)

Deletes a tag from an image.

Parameters:tag – Tag to delete.
get_log()

Repeatedly gets the parent of a given image until it reaches the bottom.

get_parent_children()

Gets the parent and a list of children of a given image.

get_table(table_name)

Returns a Table object representing a version of a given table. Contains a list of objects that the table is linked to: a DIFF object (beginning a chain of DIFFs that describe a table), a SNAP object (a full table copy), or both.

Parameters:table_name – Name of the table
Returns:Table object or None
get_tables()

Gets the names of all tables inside of an image.

get_tags()

Lists all tags that this image has.

provenance()

Inspects the image’s parent chain to come up with a set of repositories and their hashes that it was created from.

Returns:List of (repository, image_hash)
set_provenance(provenance_type, **kwargs)

Sets the image’s provenance. Internal function called by the Splitfile interpreter, shouldn’t be called directly as it changes the image after it’s been created.

Parameters:
  • provenance_type – One of “SQL”, “MOUNT”, “IMPORT” or “FROM”
  • kwargs – Extra provenance-specific arguments
tag(tag, force=False)

Tags a given image. All tags are unique inside of a repository.

Parameters:
  • tag – Tag to set. ‘latest’ and ‘HEAD’ are reserved tags.
  • force – Whether to remove the old tag if an image with this tag already exists.
to_splitfile(err_on_end=True, source_replacement=None)

Crawls the image’s parent chain to recreates a Splitfile that can be used to reconstruct it.

Parameters:
  • err_on_end – If False, when an image with no provenance is reached and it still has a parent, then instead of raising an exception, it will base the Splitfile (using the FROM command) on that image.
  • source_replacement – A dictionary of repositories and image hashes/tags specifying how to replace the dependencies of this Splitfile (table imports and FROM commands).
Returns:

A list of Splitfile commands that can be fed back into the executor.

splitgraph.core.object_manager module

Functions related to creating, deleting and keeping track of physical Splitgraph objects.

class splitgraph.core.object_manager.ObjectManager(object_engine)

Bases: splitgraph.core.fragment_manager.FragmentManager, splitgraph.core.metadata_manager.MetadataManager

Brings the multiple manager classes together and manages the object cache (downloading and uploading objects as required in order to fulfill certain queries)

cleanup()

Deletes all objects in the object_tree not required by any current repository, including their dependencies and their remote locations. Also deletes all objects not registered in the object_tree.

delete_objects(objects)

Deletes objects from the Splitgraph cache

Parameters:objects – A sequence of objects to be deleted
download_objects(source, objects_to_fetch, object_locations)

Fetches the required objects from the remote and stores them locally. Does nothing for objects that already exist.

Parameters:
  • source – Remote ObjectManager. If None, will only try to download objects from the external location.
  • objects_to_fetch – List of object IDs to download.
  • object_locations – List of custom object locations, encoded as tuples (object_id, object_url, protocol).
ensure_objects(table, quals=None)

Resolves the objects needed to materialize a given table and makes sure they are in the local splitgraph_meta schema.

Whilst inside this manager, the objects are guaranteed to exist. On exit from it, the objects are marked as unneeded and can be garbage collected.

Parameters:
  • table – Table to materialize
  • quals – Optional list of qualifiers to be passed to the fragment engine. Fragments that definitely do not match these qualifiers will be dropped. See the docstring for filter_fragments for the format.
Returns:

List of table fragments

get_cache_occupancy()
Returns:Space occupied by objects cached from external locations, in bytes.
get_downloaded_objects(limit_to=None)

Gets a list of objects currently in the Splitgraph cache (i.e. not only existing externally.)

Parameters:limit_to – If specified, only the objects in this list will be returned.
Returns:Set of object IDs.
get_full_object_tree()

Returns a dictionary (object_id -> parent, object_format, size) with the full object tree in the engine

run_eviction(object_tree, keep_objects, required_space=None)

Delete enough objects with zero reference count (only those, since we guarantee that whilst refcount is >0, the object stays alive) to free at least required_space in the cache.

Parameters:
  • object_tree – Object tree dictionary
  • keep_objects – List of objects (besides those with nonzero refcount) that can’t be deleted.
  • required_space – Space, in bytes, to free. If the routine can’t free at least this much space, it shall raise an exception. If None, removes all eligible objects.
upload_objects(target, objects_to_push, handler='DB', handler_params=None)

Uploads physical objects to the remote or some other external location.

Parameters:
  • target – Target ObjectManager
  • objects_to_push – List of object IDs to upload.
  • handler – Name of the handler to use to upload objects. Use DB to push them to the remote, FILE to store them in a directory that can be accessed from the client and HTTP to upload them to HTTP.
  • handler_params – For HTTP, a dictionary {“username”: username, “password”, password}. For FILE, a dictionary {“path”: path} specifying the directory where the objects shall be saved.
Returns:

A list of (object_id, url, handler) that specifies all objects were uploaded (skipping objects that already exist on the remote).

splitgraph.core.registry module

Functions for communicating with the remote Splitgraph catalog

splitgraph.core.registry.get_info_key(engine, key)

Gets a configuration key from the remote registry, used to notify the client of the registry’s capabilities.

Parameters:
  • engine – Engine
  • key – Key to get
splitgraph.core.registry.get_published_info(repository, tag)

Get information on an image that’s published in a catalog.

Parameters:
  • repository – Repository
  • tag – Image tag
Returns:

A tuple of (image_hash, published_timestamp, provenance, readme, table schemata, previews)

splitgraph.core.registry.publish_tag(repository, tag, image_hash, published, provenance, readme, schemata, previews)

Publishes a given tag in the remote catalog. Should’t be called directly. Use splitgraph.commands.publish instead.

Parameters:
  • repository – Remote (!) Repository object
  • tag – Tag to publish
  • image_hash – Image hash corresponding to the given tag.
  • published – Publish time (datetime)
  • provenance – A list of tuples (repository, image_hash) showing what the image was created from
  • readme – An optional README for the repo
  • schemata – Dict mapping table name to a list of (column name, column type)
  • previews – Dict mapping table name to a list of tuples with a preview
splitgraph.core.registry.set_info_key(engine, key, value)

Sets a configuration value on the remote registry.

Parameters:
  • engine – Engine
  • key – Key to set
  • value – New value for the key
splitgraph.core.registry.setup_registry_mode(engine)

Drops tables in splitgraph_meta that aren’t pertinent to the registry + sets up access policies/RLS:

  • Normal users aren’t allowed to create tables/schemata (can’t do checkouts inside of a registry or upload SG objects directly to it)
  • images/tables/tags meta tables: can only create/update/delete records where the namespace = user ID
  • objects/object_location tables: same. An object (piece of data) becomes owned by the user that creates it and still remains so even if someone else’s image starts using it. Hence, the original owner can delete or change it (since they control the external location they’ve uploaded it to anyway).
splitgraph.core.registry.toggle_registry_rls(engine, mode='ENABLE')

Switches row-level security on the registry, restricting write access to metadata tables to owners of relevant repositories/objects.

Parameters:
  • engine – Engine
  • mode – ENABLE, DISABLE or FORCE (enable for superusers/table owners)
splitgraph.core.registry.unpublish_repository(repository)

Deletes the repository from the remote catalog.

Parameters:repository – Repository to unpublish

splitgraph.core.repository module

Public API for managing images in a Splitgraph repository.

class splitgraph.core.repository.ImageManager(repository)

Bases: object

Collects various image-related functions.

add(parent_id, image, created=None, comment=None, provenance_type=None, provenance_data=None)

Registers a new image in the Splitgraph image tree.

Internal method used by actual image creation routines (committing, importing or pulling).

Parameters:
  • parent_id – Parent of the image
  • image – Image hash
  • created – Creation time (defaults to current timestamp)
  • comment – Comment (defaults to empty)
  • provenance_type – Image provenance that can be used to rebuild the image (one of None, FROM, MOUNT, IMPORT, SQL)
  • provenance_data – Extra provenance data (dictionary).
by_hash(image_hash, raise_on_none=True)

Returns an image corresponding to a given (possibly shortened) image hash. If the image hash is ambiguous, raises an error. If the image does not exist, raises an error or returns None.

Parameters:
  • image_hash – Image hash (can be shortened).
  • raise_on_none – Whether to raise if the image doesn’t exist.
Returns:

Image object or None

by_tag(tag, raise_on_none=True)

Returns an image with a given tag

Parameters:
  • tag – Tag. ‘latest’ is a special case: it returns the most recent image in the repository.
  • raise_on_none – Whether to raise an error or return None if the tag doesn’t exist.
delete(images)

Deletes a set of Splitgraph images from the repository. Note this doesn’t check whether this will orphan some other images in the repository and can make the state of the repository invalid.

Image deletions won’t be replicated on push/pull (those can only add new images).

Parameters:images – List of image IDs
get_all_child_images(start_image)

Get all children of start_image of any degree.

get_all_parent_images(start_images)

Get all parents of the ‘start_images’ set of any degree.

class splitgraph.core.repository.Repository(namespace, repository, engine=None)

Bases: object

Splitgraph repository API

commit(image_hash=None, comment=None, snap_only=False, chunk_size=10000, split_changeset=False)

Commits all pending changes to a given repository, creating a new image.

Parameters:
  • image_hash – Hash of the commit. Chosen by random if unspecified.
  • comment – Optional comment to add to the commit.
  • snap_only – If True, will store the table as a full snapshot instead of delta compression
  • chunk_size – For tables that are stored as snapshots (new tables and where snap_only has been passed, the table will be split into fragments of this many rows.
  • split_changeset – If True, splits the changeset into multiple fragments based on the PK regions spanned by the current table fragments. For example, if the original table consists of 2 fragments, first spanning rows 1-10000, second spanning rows 10001-20000 and the change alters rows 1, 10001 and inserts a row with PK 20001, this will record the change as 3 fragments: one inheriting from the first original fragment, one inheriting from the second and a brand new fragment. This increases the number of fragments in total but means that fewer rows will need to be scanned to satisfy a query. If False, the changeset will be stored as a single fragment inheriting from the last fragment in the table.
Returns:

The newly created Image object.

delete(unregister=True, uncheckout=True)

Discards all changes to a given repository and optionally all of its history, as well as deleting the Postgres schema that it might be checked out into. Doesn’t delete any cached physical objects.

After performing this operation, this object becomes invalid and must be discarded, unless init() is called again.

Parameters:
  • unregister – Whether to purge repository history/metadata
  • uncheckout – Whether to delete the actual checked out repo
diff(table_name, image_1, image_2, aggregate=False)

Compares the state of a table in different images. If the two images are on the same path in the commit tree, it doesn’t need to materialize any of the tables and simply aggregates their DIFF objects to produce a complete changelog. Otherwise, it materializes both tables into a temporary space and compares them row-to-row.

Parameters:
  • table_name – Name of the table.
  • image_1 – First image hash / object. If None, uses the state of the current staging area.
  • image_2 – Second image hash / object. If None, uses the state of the current staging area.
  • aggregate – If True, returns a tuple of integers denoting added, removed and updated rows between the two images.
Returns:

If the table doesn’t exist in one of the images, returns True if it was added and False if it was removed. If aggregate is True, returns the aggregation of changes as specified before. Otherwise, returns a list of changes where each change is of the format (primary key, action_type, action_data):

  • action_type == 0 is Insert and the action_data contains a dictionary of non-PK columns and values
    inserted.
  • action_type == 1: Delete, action_data is None.
  • action_type == 2: Update, action_data is a dictionary of non-PK columns and their new values for
    that particular row.

dump(stream)

Creates an SQL dump with the metadata required for the repository and all of its objects.

Parameters:stream – Stream to dump the data into.
classmethod from_schema(schema)

Convert a Postgres schema name of the format namespace/repository to a Splitgraph repository object.

classmethod from_template(template, namespace=None, repository=None, engine=None)

Create a Repository from an existing one replacing some of its attributes.

get_all_hashes_tags()

Gets all tagged images and their hashes in a given repository.

Returns:List of (image_hash, tag)
has_pending_changes()

Detects if the repository has any pending changes (schema changes, table additions/deletions, content changes).

head

Return the HEAD image for the repository or None if the repository isn’t checked out.

import_tables(tables, source_repository, source_tables, image_hash=None, foreign_tables=False, do_checkout=True, target_hash=None, table_queries=None)

Creates a new commit in target_repository with one or more tables linked to already-existing tables. After this operation, the HEAD of the target repository moves to the new commit and the new tables are materialized.

Parameters:
  • tables – If not empty, must be the list of the same length as source_tables specifying names to store them under in the target repository.
  • source_repository – Repository to import tables from.
  • source_tables – List of tables to import. If empty, imports all tables.
  • image_hash – Image hash in the source repository to import tables from. Uses the current source HEAD by default.
  • foreign_tables – If True, copies all source tables to create a series of new SNAP objects instead of treating them as Splitgraph-versioned tables. This is useful for adding brand new tables (for example, from an FDW-mounted table).
  • do_checkout – If False, doesn’t materialize the tables in the target mountpoint.
  • target_hash – Hash of the new image that tables is recorded under. If None, gets chosen at random.
  • table_queries – If not [], it’s treated as a Boolean mask showing which entries in the tables list are instead SELECT SQL queries that form the target table. The queries have to be non-schema qualified and work only against tables in the source repository. Each target table created is the result of the respective SQL query. This is committed as a new snapshot.
Returns:

Hash that the new image was stored under.

init()

Initializes an empty repo with an initial commit (hash 0000…)

materialized_table(table_name, image_hash)

A context manager that returns a pointer to a read-only materialized table in a given image. If the table is already stored as a SNAP, this doesn’t use any extra space. Otherwise, the table is materialized and deleted on exit from the context manager.

Parameters:
  • table_name – Name of the table
  • image_hash – Image hash to materialize
Returns:

(schema, table_name) where the materialized table is located. The table must not be changed, as it might be a pointer to a real SG SNAP object.

publish(tag, remote_repository=None, readme='', include_provenance=True, include_table_previews=True)

Summarizes the data on a previously-pushed repository and makes it available in the catalog.

Parameters:
  • tag – Image tag. Only images with tags can be published.
  • remote_repository – Remote Repository object (uses the upstream if unspecified)
  • readme – Optional README for the repository.
  • include_provenance – If False, doesn’t include the dependencies of the image
  • include_table_previews – Whether to include data previews for every table in the image.
pull(download_all=False)

Synchronizes the state of the local Splitgraph repository with its upstream, optionally downloading all new objects created on the remote.

Parameters:download_all – If True, downloads all objects and stores them locally. Otherwise, will only download required objects when a table is checked out.
push(remote_repository=None, handler='DB', handler_options=None)

Inverse of pull: Pushes all local changes to the remote and uploads new objects.

Parameters:
  • remote_repository – Remote repository to push changes to. If not specified, the current upstream is used.
  • handler – Name of the handler to use to upload objects. Use DB to push them to the remote or S3 to store them in an S3 bucket.
  • handler_options – Extra options to pass to the handler. For example, see splitgraph.hooks.s3.S3ExternalObjectHandler.
run_sql(sql, arguments=None, return_shape=<ResultShape.MANY_MANY: 4>)

Execute an arbitrary SQL statement inside of this repository’s checked out schema.

set_tags(tags, force=False)

Sets tags for multiple images.

Parameters:
  • tags – List of (image_hash, tag)
  • force – Whether to remove the old tag if an image with this tag already exists.
to_schema()

Returns the engine schema that this repository gets checked out into.

uncheckout(force=False)

Deletes the schema that the repository is checked out into

Parameters:force – Discards all pending changes to the schema.
upstream

The remote upstream repository that this local repository tracks.

splitgraph.core.repository.clone(remote_repository, local_repository=None, download_all=False)

Clones a remote Splitgraph repository or synchronizes remote changes with the local ones.

If the target repository has no set upstream engine, the source repository becomes its upstream.

Parameters:
  • remote_repository – Remote Repository object to clone or the repository’s name. If a name is passed, the repository will be looked up on the current lookup path in order to find the engine the repository belongs to.
  • local_repository – Local repository to clone into. If None, uses the same name as the remote.
  • download_all – If True, downloads all objects and stores them locally. Otherwise, will only download required objects when a table is checked out.
Returns:

A locally cloned Repository object.

splitgraph.core.repository.getrandbits(k) → x. Generates an int with k random bits.
splitgraph.core.repository.import_table_from_remote(remote_repository, remote_tables, remote_image_hash, target_repository, target_tables, target_hash=None)

Shorthand for importing one or more tables from a yet-uncloned remote. Here, the remote image hash is required, as otherwise we aren’t necessarily able to determine what the remote head is.

Parameters:
  • remote_repository – Remote Repository object
  • remote_tables – List of remote tables to import
  • remote_image_hash – Image hash to import the tables from
  • target_repository – Target repository to import the tables to
  • target_tables – Target table aliases
  • target_hash – Hash of the image that’s created with the import. Default random.
splitgraph.core.repository.table_exists_at(repository, table_name, image=None)

Determines whether a given table exists in a Splitgraph image without checking it out. If image_hash is None, determines whether the table exists in the current staging area.

splitgraph.core.table module

Table metadata-related classes.

class splitgraph.core.table.Table(repository, image, table_name, table_schema, objects)

Bases: object

Represents a Splitgraph table in a given image. Shouldn’t be created directly, use Table-loading methods in the splitgraph.core.image.Image class instead.

materialize(destination, destination_schema=None, lq_server=None)

Materializes a Splitgraph table in the target schema as a normal Postgres table, potentially downloading all required objects and using them to reconstruct the table.

Parameters:
  • destination – Name of the destination table.
  • destination_schema – Name of the destination schema.
  • lq_server – If set, sets up a layered querying FDW for the table instead using this foreign server.

Module contents

Core Splitgraph functionality: versioning and sharing tables.

The main point of interaction with the Splitgraph API is a splitgraph.core.repository.Repository object representing a local or a remote Splitgraph repository. Repositories can be created using one of the following methods:

  • Directly by invoking Repository(namespace, name, engine) where engine is the engine that the repository belongs to (that can be gotten with get_engine(engine_name). If the created repository doesn’t actually exist on the engine, it must first be initialized with repository.init().
  • By using splitgraph.core.engine.lookup_repository() which will search for the repository on the current lookup path.