splitgraph.core package

Module contents

Core Splitgraph functionality: versioning and sharing tables.

The main point of interaction with the Splitgraph API is a splitgraph.core.repository.Repository object representing a local or a remote Splitgraph repository. Repositories can be created using one of the following methods:

  • Directly by invoking Repository(namespace, name, engine) where engine is the engine that the repository belongs to (that can be gotten with get_engine(engine_name). If the created repository doesn’t actually exist on the engine, it must first be initialized with repository.init().

  • By using splitgraph.core.engine.lookup_repository() which will search for the repository on the current lookup path.

Submodules

splitgraph.core.common module

Common internal functions used by Splitgraph commands.

class splitgraph.core.common.CallbackList(iterable=(), /)

Bases: list

Used to pass around and call multiple callbacks at once.

class splitgraph.core.common.Tracer

Bases: object

Accumulates events and returns the times between them.

get_durations()List[Tuple[str, float]]

Return all events and durations between them. :return: List of (event name, time to this event from the previous event (or start))

get_total_time()float
Returns

Time from start to the final logged event.

log(event: str)None

Log an event at the current time :param event: Event name

splitgraph.core.common.adapt(value: Any, pg_type: str)Any

Coerces a value with a PG type into its Python equivalent.

Parameters
  • value – Value

  • pg_type – Postgres datatype

Returns

Coerced value.

splitgraph.core.common.aggregate_changes(query_result: List[Tuple[int, int]], initial: Optional[Tuple[int, int, int]] = None)Tuple[int, int, int]

Add a changeset to the aggregated diff result

splitgraph.core.common.coerce_val_to_json(val: Any)Any

Turn a Python value to a string/float that can be stored as JSON.

splitgraph.core.common.ensure_metadata_schema(engine: PsycopgEngine)None

Create or migrate the metadata schema splitgraph_meta that stores the hash tree of schema snapshots (images), tags and tables. This means we can’t mount anything under the schema splitgraph_meta – much like we can’t have a folder “.git” under Git version control…

splitgraph.core.common.gather_sync_metadata(target: Repository, source: Repository, overwrite_objects=False, overwrite_tags=False, single_image: Optional[str] = None)Any

Inspects two Splitgraph repositories and gathers metadata that is required to bring target up to date with source.

Parameters
  • target – Target Repository object

  • source – Source repository object

  • overwrite_objects – If True, will return metadata for all objects belonging to new images (or existing image if single_image=True)

  • single_image – If set, only grab a single image with this hash/tag from the source.

  • overwrite_tags – If True and single_image is set, will return all tags for that image. If single_image is not set, will return all tags in the source repository. If False, will only return tags in the source that don’t exist on the target.

Returns

Tuple of metadata for new_images, new_tables, object_locations, object_meta, tags

splitgraph.core.common.get_data_safe(package: str, resource: str)bytes
splitgraph.core.common.get_temporary_table_id()str

Generate a random ID for temporary/staging objects that haven’t had their ID calculated yet.

splitgraph.core.common.getrandbits(k)x.  Generates an int with k random bits.
splitgraph.core.common.manage_audit(func: Callable)Callable

A decorator to be put around various Splitgraph commands that adds/removes audit triggers for new/committed/deleted tables.

splitgraph.core.common.manage_audit_triggers(engine: PostgresEngine, object_engine: Optional[PostgresEngine] = None)None

Does bookkeeping on audit triggers / audit table:

  • Detect tables that are being audited that don’t need to be any more (e.g. they’ve been unmounted)

  • Drop audit triggers for those and delete all audit info for them

  • Set up audit triggers for new tables

If the metadata engine isn’t the same as the object engine, this does nothing.

Parameters
  • engine – Metadata engine with information about images and their checkout state

  • object_engine – Object engine where the checked-out table and the audit triggers are located.

splitgraph.core.common.resource_path(relative_path)
splitgraph.core.common.set_head(repository: Repository, image: Optional[str])None

Sets the HEAD pointer of a given repository to a given image. Shouldn’t be used directly.

splitgraph.core.common.set_tag(repository: Repository, image_hash: Optional[str], tag: str)None

Internal function – add a tag to an image.

splitgraph.core.common.set_tags_batch(repository: Repository, hashes_tags: List[Tuple[str, str]])None
splitgraph.core.common.slow_diff(repository: Repository, table_name: str, image_1: Optional[str], image_2: Optional[str], aggregate: bool)Union[Tuple[int, int, int], List[Tuple[bool, Tuple]]]

Materialize both tables and manually diff them

splitgraph.core.engine module

Routines for managing Splitgraph engines, including looking up repositories and managing objects.

splitgraph.core.engine.get_current_repositories(engine: PostgresEngine)List[Tuple[Repository, Optional[Image]]]

Lists all repositories currently in the engine.

Parameters

engine – Engine

Returns

List of (Repository object, current HEAD image)

splitgraph.core.engine.init_engine(skip_object_handling: bool = False)None

Initializes the engine by:

  • performing any required engine-custom initialization

  • creating the metadata tables

Parameters

skip_object_handling – If True, skips installing routines related to object handling and checkouts (like audit triggers and CStore management).

splitgraph.core.engine.lookup_repository(name: str, include_local: bool = False)Repository

Queries the SG engines on the lookup path to locate one hosting the given repository.

Parameters
  • name – Repository name

  • include_local – If True, also queries the local engine

Returns

Local or remote Repository object

splitgraph.core.engine.repository_exists(repository: Repository)bool

Checks if a repository exists on the engine.

Parameters

repository – Repository object

splitgraph.core.fdw_checkout module

splitgraph.core.fragment_manager module

Routines related to storing tables as fragments.

class splitgraph.core.fragment_manager.Digest(shorts: Tuple[int, ])

Bases: object

Homomorphic hashing similar to LtHash (but limited to being backed by 256-bit hashes). The main property is that for any rows A, B, LtHash(A) + LtHash(B) = LtHash(A+B). This is done by construction: we simply hash individual rows and then do bit-wise addition / subtraction of individual hashes to come up with the full table hash.

Hence, the content hash of any Splitgraph table fragment is the sum of hashes of its added rows minus the sum of hashes of its deleted rows (including the old values of the rows that have been updated). This has a very useful implication: the hash of a full Splitgraph table is equal to the sum of hashes of its individual fragments.

This property can be used to simplify deduplication.

classmethod empty()splitgraph.core.fragment_manager.Digest

Return an empty Digest instance such that for any Digest D, D + empty == D - empty == D

classmethod from_hex(hex_string: str)splitgraph.core.fragment_manager.Digest

Create a Digest from a 64-characters (256-bit) hexadecimal string

classmethod from_memoryview(memory: Union[bytes, memoryview])splitgraph.core.fragment_manager.Digest

Create a Digest from a 256-bit memoryview/bytearray.

hex()str

Convert the hash into a hexadecimal value.

class splitgraph.core.fragment_manager.FragmentManager(object_engine: PostgresEngine, metadata_engine: Optional[PostgresEngine] = None)

Bases: splitgraph.core.metadata_manager.MetadataManager

A storage engine for Splitgraph tables. Each table can be stored as one or more immutable fragments that can optionally overwrite each other. When a new table is created, it’s split up into multiple base fragments. When a new version of the table is written, the audit log is inspected and one or more patch fragments are created, to be based on the fragments the previous version of the table consisted of. Only the top fragments in this stack are stored in the table metadata: to reconstruct the whole table, the links from the top fragments down to the base fragments have to be followed.

In addition, the fragments engine also supports min-max indexing on fragments: this is used to only fetch fragments that are required for a given query.

calculate_content_hash(schema: str, table: str, table_schema: Optional[List[splitgraph.core.types.TableColumn]] = None, chunk_id_col: Optional[str] = None, chunk_id: Optional[int] = None)Tuple[str, int]

Calculates the homomorphic hash of table contents.

Parameters
  • schema – Schema the table belongs to

  • table – Name of the table

  • table_schema – Schema of the table

  • chunk_id_col – Column the table is partitioned on

  • chunk_id – Column value to get rows from

Returns

A 64-character (256-bit) hexadecimal string with the content hash of the table and the number of rows in the hash.

calculate_fragment_insertion_hash_stats(schema: str, table: str, table_schema: Optional[List[splitgraph.core.types.TableColumn]] = None)Tuple[splitgraph.core.fragment_manager.Digest, int]

Calculate the homomorphic hash of just the rows that a given fragment inserts :param schema: Schema the fragment is stored in. :param table: Name of the table the fragment is stored in. :return: A Digest object and the number of inserted rows

create_base_fragment(source_schema: str, source_table: str, namespace: str, chunk_id_col: Optional[str] = None, chunk_id: Optional[int] = None, extra_indexes: Optional[Dict[str, Union[List[str], Dict[str, Dict[str, Any]]]]] = None, in_fragment_order: Optional[List[str]] = None, overwrite: bool = False, table_schema: Optional[List[splitgraph.core.types.TableColumn]] = None)str
delete_objects(objects: Union[Set[str], List[str]])None

Deletes objects from the Splitgraph cache

Parameters

objects – A sequence of objects to be deleted

filter_fragments(object_ids: List[str], table: Table, quals: Any)List[str]

Performs fuzzy filtering on the given object IDs using the index and a set of qualifiers, discarding objects that definitely do not match the qualifiers.

Parameters
  • object_ids – List of object IDs to filter.

  • table – A Table object the objects belong to.

  • quals

    List of qualifiers in conjunctive normal form that will be matched against the index. Objects that definitely don’t match these qualifiers will be discarded.

    A list containing [[qual_1, qual_2], [qual_3, qual_4]] will be interpreted as (qual_1 OR qual_2) AND (qual_3 OR qual_4).

    Each qual is a tuple of (column_name, operator, value) whereoperator can be one of >, >=, <, <=, =.

    For unknown operators, it will be assumed that all fragments might match that clause.

Returns

List of objects that might match the given qualifiers.

generate_object_index(object_id: str, table_schema: List[splitgraph.core.types.TableColumn], changeset: Optional[Dict[Tuple[str, ], Tuple[bool, Dict[str, Any], Dict[str, Any]]]] = None, extra_indexes: Optional[Dict[str, Union[List[str], Dict[str, Dict[str, Any]]]]] = None)Dict[str, Any]

Queries the max/min values of a given fragment for each column, used to speed up querying.

Parameters
  • object_id – ID of an object

  • table_schema – Schema of the table the object belongs to.

  • changeset – Optional, if specified, the old row values are included in the index.

  • extra_indexes – Dictionary of &lbrace;index_type: column: index_specific_kwargs&rbrace;.

Returns

Dict containing the object index.

get_min_max_pks(fragments: List[str], table_pks: List[Tuple[str, str]])List[Tuple[Tuple, Tuple]]

Get PK ranges for given fragments using the index (without reading the fragments).

Parameters
  • fragments – List of object IDs (must be registered and with the same schema)

  • table_pks – List of tuples (column, type) that form the object PK.

Returns

List of (min, max) PK for every fragment where PK is a tuple. If a fragment doesn’t exist or doesn’t have a corresponding index entry, a SplitGraphError is raised.

record_table_as_base(repository: Repository, table_name: str, image_hash: str, chunk_size: Optional[int] = 10000, source_schema: Optional[str] = None, source_table: Optional[str] = None, extra_indexes: Optional[Dict[str, Union[List[str], Dict[str, Dict[str, Any]]]]] = None, in_fragment_order: Optional[List[str]] = None, overwrite: bool = False)List[str]

Copies the full table verbatim into one or more new base fragments and registers them.

Parameters
  • repository – Repository

  • table_name – Table name

  • image_hash – Hash of the new image

  • chunk_size – If specified, splits the table into multiple objects with a given number of rows

  • source_schema – Override the schema the source table is stored in

  • source_table – Override the name of the table the source is stored in

  • extra_indexes – Dictionary of &lbrace;index_type: column: index_specific_kwargs&rbrace;.

  • in_fragment_order – Key to sort data inside each chunk by.

  • overwrite – Overwrite physical objects that already exist.

record_table_as_patch(old_table: Table, schema: str, image_hash: str, new_schema_spec: List[splitgraph.core.types.TableColumn] = None, split_changeset: bool = False, extra_indexes: Optional[Dict[str, Union[List[str], Dict[str, Dict[str, Any]]]]] = None, in_fragment_order: Optional[List[str]] = None, overwrite: bool = False)None

Flushes the pending changes from the audit table for a given table and records them, registering the new objects.

Parameters
  • old_table – Table object pointing to the current HEAD table

  • schema – Schema the table is checked out into.

  • image_hash – Image hash to store the table under

  • new_schema_spec – New schema of the table (use the old table’s schema by default).

  • split_changeset – See Repository.commit for reference

  • extra_indexes – Dictionary of &lbrace;index_type: column: index_specific_kwargs&rbrace;.

splitgraph.core.fragment_manager.get_chunk_groups(chunks: List[Tuple[str, Any, Any]])List[List[Tuple[str, Any, Any]]]

Takes a list of chunks and their boundaries and combines them into independent groups such that chunks from no two groups overlap with each other (intervals are assumed to be closed, e.g. chunk (1,2) overlaps with chunk (2,3)).

The original order of chunks is preserved within each group.

For example, 4 chunks A, B, C, D that don’t overlap each other will be grouped into 4 groups [A], [B], [C], [D].

If A overlaps B, the result will be [A, B], [C], [D].

If in addition B overlaps C (but not A), the result will be [A, B, C], [D].

If in addition D overlaps any of A, B or C, the result will be [A, B, C, D](despite that D is located before A: it will be last since it was last in the original list).

Parameters

chunks – List of (chunk_id, start, end)

Returns

List of lists of (chunk_id, start, end)

splitgraph.core.image module

Image representation and provenance

class splitgraph.core.image.Image(image_hash: str, parent_id: Optional[str], created: datetime.datetime, comment: str, provenance_data: List[Dict[str, Union[str, List[str], List[bool], List[Dict[str, str]]]]], repository: Repository)

Bases: tuple

Represents a Splitgraph image. Should’t be created directly, use Image-loading methods in thesplitgraph.core.repository.Repository class instead.

checkout(force: bool = False, layered: bool = False)None

Checks the image out, changing the current HEAD pointer. Raises an error if there are pending changes to its checkout.

Parameters
  • force – Discards all pending changes to the schema.

  • layered – If True, uses layered querying to check out the image (doesn’t materialize tables inside of it).

property comment

Alias for field number 3

property created

Alias for field number 2

delete_tag(tag: str)None

Deletes a tag from an image.

Parameters

tag – Tag to delete.

property engine
get_log()List[splitgraph.core.image.Image]

Repeatedly gets the parent of a given image until it reaches the bottom.

get_parent_children()Tuple[Optional[str], List[str]]

Gets the parent and a list of children of a given image.

get_size()int

Get the physical size used by the image’s objects (including those that might be shared with other images).

This is calculated from the metadata, the on-disk footprint might be smaller if not all of image’s objects have been downloaded.

Returns

Size of the image in bytes.

get_table(table_name: str)splitgraph.core.table.Table

Returns a Table object representing a version of a given table. Contains a list of objects that the table is linked to and the table’s schema.

Parameters

table_name – Name of the table

Returns

Table object

get_tables()List[str]

Gets the names of all tables inside of an image.

get_tags()

Lists all tags that this image has.

property image_hash

Alias for field number 0

property object_engine
property parent_id

Alias for field number 1

provenance(reverse=False, engine=None)List[Tuple[Repository, str]]

Inspects the image’s parent chain to come up with a set of repositories and their hashes that it was created from.

If reverse is True, returns a list of images that were created _from_ this image. If this image is on a remote repository, engine can be passed in to override the engine used for the lookup of dependents.

Returns

List of (repository, image_hash)

property provenance_data

Alias for field number 4

query_schema(wrapper: Optional[str] = 'splitgraph.core.fdw_checkout.QueryingForeignDataWrapper', commit: bool = True)Iterator[str]

Creates a temporary schema with tables in this image mounted as foreign tables that can be accessed via read-only layered querying. On exit from the context manager, the schema is discarded.

Returns

The name of the schema the image is located in.

property repository

Alias for field number 5

set_provenance(provenance_data: List[Dict[str, Union[str, List[str], List[bool], List[Dict[str, str]]]]])None

Sets the image’s provenance. Internal function called by the Splitfile interpreter, shouldn’t be called directly as it changes the image after it’s been created.

Parameters

provenance_data – List of parsed Splitfile commands and their data.

tag(tag: str)None

Tags a given image. All tags are unique inside of a repository. If a tag already exists, it’s removed from the previous image and given to the new image.

Parameters

tag – Tag to set. ‘latest’ and ‘HEAD’ are reserved tags.

to_splitfile(ignore_irreproducible: bool = False, source_replacement: Optional[Dict[Repository, str]] = None)List[str]

Recreate the Splitfile that can be used to reconstruct this image.

Parameters
  • ignore_irreproducible – If True, ignore commands from irreproducible Splitfile lines (like MOUNT or custom commands) and instead emit a comment (this results in an invalid Splitfile).

  • source_replacement – A dictionary of repositories and image hashes/tags specifying how to replace the dependencies of this Splitfile (table imports and FROM commands).

Returns

A list of Splitfile commands that can be fed back into the executor.

splitgraph.core.image.getrandbits(k)x.  Generates an int with k random bits.
splitgraph.core.image.reconstruct_splitfile(provenance_data: List[Dict[str, Union[str, List[str], List[bool], List[Dict[str, str]]]]], ignore_irreproducible: bool = False, source_replacement: Optional[Dict[Repository, str]] = None)List[str]

Recreate the Splitfile that can be used to reconstruct an image.

splitgraph.core.image_manager module

class splitgraph.core.image_manager.ImageManager(repository: Repository)

Bases: object

Collects various image-related functions.

add(parent_id: Optional[str], image: str, created: Optional[datetime.datetime] = None, comment: Optional[str] = None, provenance_data: Optional[List[Dict[str, Union[str, List[str], List[bool], List[Dict[str, str]]]]]] = None)None

Registers a new image in the Splitgraph image tree.

Internal method used by actual image creation routines (committing, importing or pulling).

Parameters
  • parent_id – Parent of the image

  • image – Image hash

  • created – Creation time (defaults to current timestamp)

  • comment – Comment (defaults to empty)

  • provenance_data – Provenance data that can be used to reconstruct the image.

add_batch(images: List[splitgraph.core.image.Image])None

Like add, but registers multiple images at the same time. Used in push/pull to avoid a roundtrip to the registry for each image :param images: List of Image objects. Namespace and repository will be patched

with this repository.

by_hash(image_hash: str)splitgraph.core.image.Image

Returns an image corresponding to a given (possibly shortened) image hash. If the image hash is ambiguous, raises an error. If the image does not exist, raises an error or returns None.

Parameters

image_hash – Image hash (can be shortened).

Returns

Image

by_tag(tag: str, raise_on_none: bool = True)Optional[splitgraph.core.image.Image]

Returns an image with a given tag

Parameters
  • tag – Tag. ‘latest’ is a special case: it returns the most recent image in the repository.

  • raise_on_none – Whether to raise an error or return None if the tag doesn’t exist.

delete(images: Sequence[str])None

Deletes a set of Splitgraph images from the repository. Note this doesn’t check whether this will orphan some other images in the repository and can make the state of the repository invalid.

Image deletions won’t be replicated on push/pull (those can only add new images).

Parameters

images – List of image IDs

get_all_child_images(start_image: str)Set[str]

Get all children of start_image of any degree.

get_all_parent_images(start_images: Set[str])Set[str]

Get all parents of the ‘start_images’ set of any degree.

splitgraph.core.metadata_manager module

Classes related to managing table/image/object metadata tables.

class splitgraph.core.metadata_manager.MetadataManager(metadata_engine: PsycopgEngine)

Bases: object

A data access layer for the metadata tables in the splitgraph_meta schema that concerns itself with image, table and object information.

cleanup_metadata()List[str]

Go through the current metadata and delete all objects that aren’t required by any table on the engine.

Returns

List of objects that have been deleted.

delete_object_meta(object_ids: Sequence[str])

Delete metadata for multiple objects (external locations, indexes, hashes). This doesn’t delete physical objects.

Parameters

object_ids – Object IDs to delete

get_all_objects()List[str]

Gets all objects currently in the Splitgraph tree.

Returns

List of object IDs.

get_external_object_locations(objects: List[str])List[Tuple[str, str, str]]

Gets external locations for objects.

Parameters

objects – List of object IDs stored externally.

Returns

List of (object_id, location, protocol).

get_new_objects(object_ids: List[str])List[str]

Get object IDs from the passed list that don’t exist in the tree.

Parameters

object_ids – List of objects to check

Returns

List of unknown object IDs.

get_object_meta(objects: List[str])Dict[str, splitgraph.core.metadata_manager.Object]

Get metadata for multiple Splitgraph objects from the tree

Parameters

objects – List of objects to get metadata for.

Returns

Dictionary of object_id -> Object

get_objects_for_repository(repository: Repository, image_hash: Optional[str] = None)List[str]
get_unused_objects(threshold: Optional[int] = None)List[Tuple[str, datetime.datetime]]

Get a list of all objects in the metadata that aren’t used by any table and can be safely deleted.

Parameters

threshold – Only return objects that were created earlier than this (in minutes)

Returns

List of objects and their creation times.

overwrite_table(repository: Repository, image_hash: str, table_name: str, table_schema: List[splitgraph.core.types.TableColumn], objects: List[str])
register_object_locations(object_locations: List[Tuple[str, str, str]])None

Registers external locations (e.g. HTTP or S3) for Splitgraph objects. Objects must already be registered in the object tree.

Parameters

object_locations – List of (object_id, location, protocol).

register_objects(objects: List[splitgraph.core.metadata_manager.Object], namespace: Optional[str] = None)None

Registers multiple Splitgraph objects in the tree.

Parameters
  • objects – List of Object objects.

  • namespace – If specified, overrides the original object namespace, required in the case where the remote repository has a different namespace than the local one.

register_tables(repository: Repository, table_meta: List[Tuple[str, str, List[splitgraph.core.types.TableColumn], List[str]]])None

Links tables in an image to physical objects that they are stored as. Objects must already be registered in the object tree.

Parameters
  • repository – Repository that the tables belong to.

  • table_meta – A list of (image_hash, table_name, table_schema, object_ids).

class splitgraph.core.metadata_manager.Object(object_id: str, format: str, namespace: str, size: int, created: datetime.datetime, insertion_hash: str, deletion_hash: str, object_index: Dict[str, Any], rows_inserted: int, rows_deleted: int)

Bases: tuple

Represents a Splitgraph object that tables are composed of.

property created

Alias for field number 4

property deletion_hash

Alias for field number 6

property format

Alias for field number 1

property insertion_hash

Alias for field number 5

property namespace

Alias for field number 2

property object_id

Alias for field number 0

property object_index

Alias for field number 7

property rows_deleted

Alias for field number 9

property rows_inserted

Alias for field number 8

property size

Alias for field number 3

splitgraph.core.migration module

splitgraph.core.migration.get_installed_version(engine: PsycopgEngine, schema_name: str, version_table: str = 'version')Optional[Tuple[str, datetime.datetime]]
splitgraph.core.migration.make_file_list(schema_name: str, migration_path: List[Optional[str]])

Construct a list of file names from history of versions and schema name

splitgraph.core.migration.set_installed_version(engine: PsycopgEngine, schema_name: str, version: str, version_table: str = 'version')
splitgraph.core.migration.source_files_to_apply(engine: PsycopgEngine, schema_name: str, schema_files: List[str], version_table: str = 'version', static: bool = False, target_version: Optional[str] = None)Tuple[List[str], str]

Get the ordered list of .sql files to apply to the database

splitgraph.core.object_manager module

Functions related to creating, deleting and keeping track of physical Splitgraph objects.

class splitgraph.core.object_manager.ObjectManager(object_engine: PostgresEngine, metadata_engine: Optional[PostgresEngine] = None)

Bases: splitgraph.core.fragment_manager.FragmentManager

Brings the multiple manager classes together and manages the object cache (downloading and uploading objects as required in order to fulfill certain queries)

cleanup()List[str]

Deletes all objects in the object_tree not required by any current repository, including their dependencies and their remote locations. Also deletes all objects not registered in the object_tree.

download_objects(source: Optional[splitgraph.core.object_manager.ObjectManager], objects_to_fetch: List[str], object_locations: List[Tuple[str, str, str]])List[str]

Fetches the required objects from the remote and stores them locally. Does nothing for objects that already exist.

Parameters
  • source – Remote ObjectManager. If None, will only try to download objects from the external location.

  • objects_to_fetch – List of object IDs to download.

  • object_locations – List of custom object locations, encoded as tuples (object_id, object_url, protocol).

ensure_objects(table: Optional[Table], objects: Optional[List[str]] = None, quals: Optional[Sequence[Sequence[Tuple[str, str, Any]]]] = None, defer_release: bool = False, tracer: Optional[splitgraph.core.common.Tracer] = None, upstream_manager: Optional[ObjectManager] = None)Iterator[Union[List[str], Tuple[List[str], splitgraph.core.common.CallbackList]]]

Resolves the objects needed to materialize a given table and makes sure they are in the local splitgraph_meta schema.

Whilst inside this manager, the objects are guaranteed to exist. On exit from it, the objects are marked as unneeded and can be garbage collected.

Parameters
  • table – Table to materialize

  • objects – List of objects to download: one of table or objects must be specified.

  • quals – Optional list of qualifiers to be passed to the fragment engine. Fragments that definitely do not match these qualifiers will be dropped. See the docstring for filter_fragments for the format.

  • defer_release – If True, won’t release the objects on exit.

Returns

If defer_release is True: List of table fragments and a callback that the caller must call when the objects are no longer needed. If defer_release is False: just the list of table fragments.

get_cache_occupancy()int
Returns

Space occupied by objects cached from external locations, in bytes.

get_downloaded_objects(limit_to: Optional[List[str]] = None)List[str]

Gets a list of objects currently in the Splitgraph cache (i.e. not only existing externally.)

Parameters

limit_to – If specified, only the objects in this list will be returned.

Returns

Set of object IDs.

get_total_object_size()
Returns

Space occupied by all objects on the engine, in bytes.

make_objects_external(objects: List[str], handler: str, handler_params: Dict[Any, Any])None

Uploads local objects to an external location and marks them as being cached locally (thus making it possible to evict or swap them out).

Parameters
  • objects – Object IDs to upload. Will do nothing for objects that already exist externally.

  • handler – Object handler

  • handler_params – Extra handler parameters

run_eviction(keep_objects: List[str], required_space: Optional[int] = None)None

Delete enough objects with zero reference count (only those, since we guarantee that whilst refcount is >0, the object stays alive) to free at least required_space in the cache.

Parameters
  • keep_objects – List of objects (besides those with nonzero refcount) that can’t be deleted.

  • required_space – Space, in bytes, to free. If the routine can’t free at least this much space, it shall raise an exception. If None, removes all eligible objects.

upload_objects(target: splitgraph.core.object_manager.ObjectManager, objects_to_push: List[str], handler: str = 'DB', handler_params: Optional[Dict[Any, Any]] = None)Sequence[Tuple[str, Optional[str]]]

Uploads physical objects to the remote or some other external location.

Parameters
  • target – Target ObjectManager

  • objects_to_push – List of object IDs to upload.

  • handler – Name of the handler to use to upload objects. Use DB to push them to the remote, FILEto store them in a directory that can be accessed from the client and HTTP to upload them to HTTP.

  • handler_params – For HTTP, a dictionary &lbrace;“username”: username, “password”, password&rbrace;. For FILE, a dictionary &lbrace;“path”: path&rbrace; specifying the directory where the objects shall be saved.

Returns

A list of (object_id, url) that specifies all objects were uploaded (skipping objects that already exist on the remote).

splitgraph.core.output module

splitgraph.core.output.conn_string_to_dict(connection: Optional[str])Dict[str, Any]
splitgraph.core.output.parse_date(string: str)datetime.date
splitgraph.core.output.parse_dt(string: str)datetime.datetime
splitgraph.core.output.parse_repo_tag_or_hash(value, default='latest')
splitgraph.core.output.parse_time(string: str)time.struct_time
splitgraph.core.output.pluralise(word: str, number: int)str

1 banana, 2 bananas

splitgraph.core.output.pretty_size(size: Union[int, float])str

Converts a size in bytes to its string representation (e.g. 1024 -> 1KiB) :param size: Size in bytes

splitgraph.core.output.slugify(text: str, max_length: int = 50)str
splitgraph.core.output.truncate_line(line: str, length: int = 80)str

Truncates a line to a given length, replacing the remainder with …

splitgraph.core.output.truncate_list(items: List[Any], max_entries: int = 10)str

Print a list, possibly truncating it to the specified number of entries

splitgraph.core.registry module

Functions for communicating with the remote Splitgraph catalog

splitgraph.core.registry.get_info_key(engine: PostgresEngine, key: str)Optional[str]

Gets a configuration key from the remote registry, used to notify the client of the registry’s capabilities.

Parameters
  • engine – Engine

  • key – Key to get

splitgraph.core.registry.set_info_key(engine: PostgresEngine, key: str, value: Union[bool, str])None

Sets a configuration value on the remote registry.

Parameters
  • engine – Engine

  • key – Key to set

  • value – New value for the key

splitgraph.core.registry.setup_registry_mode(engine: PostgresEngine)None

Set up access policies/RLS:

  • Normal users aren’t allowed to create tables/schemata (can’t do checkouts inside of a registry or upload SG objects directly to it)

  • Normal users can’t access the splitgraph_meta schema directly: they’re only supposed to be able to talk to it via stored procedures in splitgraph_api. Those procedures are set up with SECURITY INVOKER (run with those users’ credentials) and what they can access is further restricted by RLS:

    • images/tables/tags meta tables: can only create/update/delete records where the namespace = user ID

    • objects/object_location tables: same. An object (piece of data) becomes owned by the user that creates it and still remains so even if someone else’s image starts using it. Hence, the original owner can delete or change it (since they control the external location they’ve uploaded it to anyway).

splitgraph.core.repository module

Public API for managing images in a Splitgraph repository.

class splitgraph.core.repository.Repository(namespace: str, repository: str, engine: Optional[splitgraph.engine.postgres.engine.PostgresEngine] = None, object_engine: Optional[splitgraph.engine.postgres.engine.PostgresEngine] = None, object_manager: Optional[splitgraph.core.object_manager.ObjectManager] = None)

Bases: object

Splitgraph repository API

commit(image_hash: Optional[str] = None, comment: Optional[str] = None, snap_only: bool = False, chunk_size: Optional[int] = None, split_changeset: bool = False, extra_indexes: Optional[Dict[str, Dict[str, Union[List[str], Dict[str, Dict[str, Any]]]]]] = None, in_fragment_order: Optional[Dict[str, List[str]]] = None, overwrite: bool = False)splitgraph.core.image.Image

Commits all pending changes to a given repository, creating a new image.

Parameters
  • image_hash – Hash of the commit. Chosen by random if unspecified.

  • comment – Optional comment to add to the commit.

  • snap_only – If True, will store the table as a full snapshot instead of delta compression

  • chunk_size – For tables that are stored as snapshots (new tables and where snap_only has been passed, the table will be split into fragments of this many rows.

  • split_changeset – If True, splits the changeset into multiple fragments based on the PK regions spanned by the current table fragments. For example, if the original table consists of 2 fragments, first spanning rows 1-10000, second spanning rows 10001-20000 and the change alters rows 1, 10001 and inserts a row with PK 20001, this will record the change as 3 fragments: one inheriting from the first original fragment, one inheriting from the second and a brand new fragment. This increases the number of fragments in total but means that fewer rows will need to be scanned to satisfy a query. If False, the changeset will be stored as a single fragment inheriting from the last fragment in the table.

  • extra_indexes – Dictionary of &lbrace;table: index_type: column: index_specific_kwargs&rbrace;.

  • in_fragment_order – Dictionary of &lbrace;table: list of columns&rbrace;. If specified, will

sort the data inside each chunk by this/these key(s) for each table. :param overwrite: If an object already exists, will force recreate it.

Returns

The newly created Image object.

commit_engines()None

Commit the underlying transactions on both engines that the repository uses.

delete(unregister: bool = True, uncheckout: bool = True)None

Discards all changes to a given repository and optionally all of its history, as well as deleting the Postgres schema that it might be checked out into. Doesn’t delete any cached physical objects.

After performing this operation, this object becomes invalid and must be discarded, unless init() is called again.

Parameters
  • unregister – Whether to purge repository history/metadata

  • uncheckout – Whether to delete the actual checked out repo. This has no effect if the repository is backed by a registry (rather than a local engine).

diff(table_name: str, image_1: Union[splitgraph.core.image.Image, str], image_2: Optional[Union[splitgraph.core.image.Image, str]], aggregate: bool = False)Optional[Union[bool, Tuple[int, int, int], List[Tuple[bool, Tuple]]]]

Compares the state of a table in different images by materializing both tables into a temporary space and comparing them row-to-row.

Parameters
  • table_name – Name of the table.

  • image_1 – First image hash / object. If None, uses the state of the current staging area.

  • image_2 – Second image hash / object. If None, uses the state of the current staging area.

  • aggregate – If True, returns a tuple of integers denoting added, removed and updated rows between the two images.

Returns

If the table doesn’t exist in one of the images, returns True if it was added and False if it was removed. If aggregate is True, returns the aggregation of changes as specified before. Otherwise, returns a list of changes where each change is a tuple of(True for added, False for removed, row contents).

dump(stream: _io.TextIOWrapper, exclude_object_contents: bool = False)None

Creates an SQL dump with the metadata required for the repository and all of its objects.

Parameters
  • stream – Stream to dump the data into.

  • exclude_object_contents – Only dump the metadata but not the actual object contents.

classmethod from_schema(schema: str)splitgraph.core.repository.Repository

Convert a Postgres schema name of the format namespace/repository to a Splitgraph repository object.

classmethod from_template(template: splitgraph.core.repository.Repository, namespace: Optional[str] = None, repository: Optional[str] = None, engine: Optional[splitgraph.engine.postgres.engine.PostgresEngine] = None, object_engine: Optional[splitgraph.engine.postgres.engine.PostgresEngine] = None)splitgraph.core.repository.Repository

Create a Repository from an existing one replacing some of its attributes.

get_all_hashes_tags()List[Tuple[Optional[str], str]]

Gets all tagged images and their hashes in a given repository.

Returns

List of (image_hash, tag)

get_local_size()int

Get the actual size used by this repository’s downloaded objects.

This might still be double-counted if the repository shares objects with other repositores.

Returns

Size of the repository in bytes.

get_size()int

Get the physical size used by the repository’s data, counting objects that are used by multiple images only once. This is calculated from the metadata, the on-disk footprint might be smaller if not all of repository’s objects have been downloaded.

Returns

Size of the repository in bytes.

has_pending_changes()bool

Detects if the repository has any pending changes (schema changes, table additions/deletions, content changes).

property head

Return the HEAD image for the repository or None if the repository isn’t checked out.

property head_strict

Return the HEAD image for the repository. Raise an exception if the repository isn’t checked out.

images

A splitgraph.core.image.ImageManager instance that performs operations (checkout, delete etc) on this repository’s images.

import_tables(tables: Sequence[str], source_repository: splitgraph.core.repository.Repository, source_tables: Sequence[str], image_hash: Optional[str] = None, foreign_tables: bool = False, do_checkout: bool = True, target_hash: Optional[str] = None, table_queries: Optional[Sequence[bool]] = None, parent_hash: Optional[str] = None, wrapper: Optional[str] = 'splitgraph.core.fdw_checkout.QueryingForeignDataWrapper', skip_validation: bool = False)str

Creates a new commit in target_repository with one or more tables linked to already-existing tables. After this operation, the HEAD of the target repository moves to the new commit and the new tables are materialized.

Parameters
  • tables – If not empty, must be the list of the same length as source_tables specifying names to store them under in the target repository.

  • source_repository – Repository to import tables from.

  • source_tables – List of tables to import. If empty, imports all tables.

  • image_hash – Image hash in the source repository to import tables from. Uses the current source HEAD by default.

  • foreign_tables – If True, copies all source tables to create a series of new snapshots instead of treating them as Splitgraph-versioned tables. This is useful for adding brand new tables (for example, from an FDW-mounted table).

  • do_checkout – If False, doesn’t check out the newly created image.

  • target_hash – Hash of the new image that tables is recorded under. If None, gets chosen at random.

  • table_queries – If not [], it’s treated as a Boolean mask showing which entries in the tables list are instead SELECT SQL queries that form the target table. The queries have to be non-schema qualified and work only against tables in the source repository. Each target table created is the result of the respective SQL query. This is committed as a new snapshot.

  • parent_hash – If not None, must be the hash of the image to base the new image on. Existing tables from the parent image are preserved in the new image. If None, the current repository HEAD is used.

  • wrapper – Override the default class for the layered querying foreign data wrapper.

  • skip_validation – Don’t validate SQL used in import statements (used by the Splitfile executor that pre-formats the SQL).

Returns

Hash that the new image was stored under.

init()None

Initializes an empty repo with an initial commit (hash 0000…)

materialized_table(table_name: str, image_hash: Optional[str])Iterator[Tuple[str, str]]

A context manager that returns a pointer to a read-only materialized table in a given image. The table is deleted on exit from the context manager.

Parameters
  • table_name – Name of the table

  • image_hash – Image hash to materialize

Returns

(schema, table_name) where the materialized table is located.

objects

A splitgraph.core.object_manager.ObjectManager instance that performs operations on the objects on this repository’s engine (not just objects belonging to this repository).

pull(download_all: Optional[bool] = False, overwrite_objects: bool = False, overwrite_tags: bool = False, single_image: Optional[str] = None)None

Synchronizes the state of the local Splitgraph repository with its upstream, optionally downloading all new objects created on the remote.

Parameters
  • download_all – If True, downloads all objects and stores them locally. Otherwise, will only download required objects when a table is checked out.

  • overwrite_objects – If True, will overwrite object metadata on the local repository for existing objects.

  • overwrite_tags – If True, will overwrite existing tags.

  • single_image – Limit the download to a single image hash/tag.

push(remote_repository: Optional[splitgraph.core.repository.Repository] = None, overwrite_objects: bool = False, reupload_objects: bool = False, overwrite_tags: bool = False, handler: str = 'DB', handler_options: Optional[Dict[str, Any]] = None, single_image: Optional[str] = None)splitgraph.core.repository.Repository

Inverse of pull: Pushes all local changes to the remote and uploads new objects.

Parameters
  • remote_repository – Remote repository to push changes to. If not specified, the current upstream is used.

  • handler – Name of the handler to use to upload objects. Use DB to push them to the remote or S3to store them in an S3 bucket.

  • overwrite_objects – If True, will overwrite object metadata on the remote repository for existing objects.

  • reupload_objects – If True, will reupload objects for which metadata is uploaded.

  • overwrite_tags – If True, will overwrite existing tags on the remote repository.

  • handler_options – Extra options to pass to the handler. For example, seesplitgraph.hooks.s3.S3ExternalObjectHandler.

  • single_image – Limit the upload to a single image hash/tag.

rollback_engines()None

Rollback the underlying transactions on both engines that the repository uses.

run_sql(sql: Union[psycopg2.sql.Composed, str], arguments: Optional[Any] = None, return_shape: splitgraph.engine.ResultShape = <ResultShape.MANY_MANY: 4>)Any

Execute an arbitrary SQL statement inside of this repository’s checked out schema.

set_tags(tags: Dict[str, Optional[str]])None

Sets tags for multiple images.

Parameters

tags – List of (image_hash, tag)

to_schema()str

Returns the engine schema that this repository gets checked out into.

uncheckout(force: bool = False)None

Deletes the schema that the repository is checked out into

Parameters

force – Discards all pending changes to the schema.

property upstream

The remote upstream repository that this local repository tracks.

splitgraph.core.repository.clone(remote_repository: Union[splitgraph.core.repository.Repository, str], local_repository: Optional[splitgraph.core.repository.Repository] = None, overwrite_objects: bool = False, overwrite_tags: bool = False, download_all: Optional[bool] = False, single_image: Optional[str] = None)splitgraph.core.repository.Repository

Clones a remote Splitgraph repository or synchronizes remote changes with the local ones.

If the target repository has no set upstream engine, the source repository becomes its upstream.

Parameters
  • remote_repository – Remote Repository object to clone or the repository’s name. If a name is passed, the repository will be looked up on the current lookup path in order to find the engine the repository belongs to.

  • local_repository – Local repository to clone into. If None, uses the same name as the remote.

  • download_all – If True, downloads all objects and stores them locally. Otherwise, will only download required objects when a table is checked out.

  • overwrite_objects – If True, will overwrite object metadata on the local repository for existing objects.

  • overwrite_tags – If True, will overwrite existing tags.

  • single_image – If set, only get a single image with this hash/tag from the source.

Returns

A locally cloned Repository object.

splitgraph.core.repository.getrandbits(k)x.  Generates an int with k random bits.
splitgraph.core.repository.import_table_from_remote(remote_repository: splitgraph.core.repository.Repository, remote_tables: List[str], remote_image_hash: str, target_repository: splitgraph.core.repository.Repository, target_tables: List[Any], target_hash: Optional[str] = None)None

Shorthand for importing one or more tables from a yet-uncloned remote. Here, the remote image hash is required, as otherwise we aren’t necessarily able to determine what the remote head is.

Parameters
  • remote_repository – Remote Repository object

  • remote_tables – List of remote tables to import

  • remote_image_hash – Image hash to import the tables from

  • target_repository – Target repository to import the tables to

  • target_tables – Target table aliases

  • target_hash – Hash of the image that’s created with the import. Default random.

splitgraph.core.repository.table_exists_at(repository: splitgraph.core.repository.Repository, table_name: str, image: Optional[splitgraph.core.image.Image] = None)bool

Determines whether a given table exists in a Splitgraph image without checking it out. If image_hash is None, determines whether the table exists in the current staging area.

splitgraph.core.server module

Routines that are run inside of the engine, here so that they can get type- and syntax-checked.

When inside of an LQFDW shim, these are called directly by the Splitgraph core code to avoid a redundant connection to the engine.

splitgraph.core.server.delete_object_files(object_id: str)
splitgraph.core.server.download_object(object_id: str, urls: Tuple[str, str, str])
splitgraph.core.server.get_object_schema(object_id: str)str
splitgraph.core.server.get_object_size(object_id: str)int
splitgraph.core.server.list_objects()List[str]
splitgraph.core.server.object_exists(object_id: str)bool
splitgraph.core.server.rename_object_files(old_object_id: str, new_object_id: str)
splitgraph.core.server.set_object_schema(object_id: str, schema: str)
splitgraph.core.server.upload_object(object_id: str, urls: Tuple[str, str, str])
splitgraph.core.server.verify(url: str)

splitgraph.core.table module

Table metadata-related classes.

class splitgraph.core.table.QueryPlan(table: splitgraph.core.table.Table, quals: Optional[Sequence[Sequence[Tuple[str, str, Any]]]], columns: Sequence[str])

Bases: object

Represents the initial query plan (fragments to query) for given columns and qualifiers.

class splitgraph.core.table.Table(repository: Repository, image: Image, table_name: str, table_schema: List[splitgraph.core.types.TableColumn], objects: List[str])

Bases: object

Represents a Splitgraph table in a given image. Shouldn’t be created directly, use Table-loading methods in the splitgraph.core.image.Image class instead.

get_length()int

Get the number of rows in this table.

This might be smaller than the total number of rows in all objects belonging to this table as some objects might overwrite each other.

Returns

Number of rows in table

get_query_plan(quals: Optional[Sequence[Sequence[Tuple[str, str, Any]]]], columns: Sequence[str], use_cache: bool = True)splitgraph.core.table.QueryPlan

Start planning a query (preliminary steps before object downloading, like qualifier filtering).

Parameters
  • quals – Qualifiers in CNF form

  • columns – List of columns

  • use_cache – If True, will fetch the plan from the cache for the same qualifiers and columns.

Returns

QueryPlan

get_size()int

Get the physical size used by the table’s objects (including those shared with other tables).

This is calculated from the metadata, the on-disk footprint might be smaller if not all of table’s objects have been downloaded.

Returns

Size of the table in bytes.

materialize(destination: str, destination_schema: Optional[str] = None, lq_server: Optional[str] = None, temporary: bool = False)None

Materializes a Splitgraph table in the target schema as a normal Postgres table, potentially downloading all required objects and using them to reconstruct the table.

Parameters
  • destination – Name of the destination table.

  • destination_schema – Name of the destination schema.

  • lq_server – If set, sets up a layered querying FDW for the table instead using this foreign server.

query(columns: List[str], quals: Sequence[Sequence[Tuple[str, str, Any]]])

Run a read-only query against this table without materializing it.

This is a wrapper around query_lazy() that force evaluates the results which might mean more fragments being materialized that aren’t needed.

Parameters
  • columns – List of columns from this table to fetch

  • quals – List of qualifiers in conjunctive normal form. See the documentation for FragmentManager.filter_fragments for the actual format.

Returns

List of dictionaries of results

query_indirect(columns: List[str], quals: Optional[Sequence[Sequence[Tuple[str, str, Any]]]])Tuple[Iterator[bytes], Callable, splitgraph.core.table.QueryPlan]

Run a read-only query against this table without materializing it. Instead of actual results, this returns a generator of SQL queries that the caller can use to get the results as well as a callback that the caller has to run after they’re done consuming the results.

In particular, the query generator will prefer returning direct queries to Splitgraph objects and only when those are exhausted will it start materializing delta-compressed fragments.

This is an advanced method: you probably want to call table.query().

Parameters
  • columns – List of columns from this table to fetch

  • quals – List of qualifiers in conjunctive normal form. See the documentation for FragmentManager.filter_fragments for the actual format.

Returns

Generator of queries (bytes), a callback and a query plan object (containing stats that are fully populated after the callback has been called to end the query).

query_lazy(columns: List[str], quals: Sequence[Sequence[Tuple[str, str, Any]]])Iterator[Iterator[Dict[str, Any]]]

Run a read-only query against this table without materializing it.

Parameters
  • columns – List of columns from this table to fetch

  • quals – List of qualifiers in conjunctive normal form. See the documentation for FragmentManager.filter_fragments for the actual format.

Returns

Generator of dictionaries of results.

reindex(extra_indexes: Dict[str, Union[List[str], Dict[str, Dict[str, Any]]]], raise_on_patch_objects=True)List[str]

Run extra indexes on all objects in this table and update their metadata. This only works on objects that don’t have any deletions or upserts (have a deletion hash of 000000…).

Parameters
  • extra_indexes – Dictionary of &lbrace;index_type: column: index_specific_kwargs&rbrace;.

  • raise_on_patch_objects – If True, will raise an exception if any objects in the table overwrite any other objects. If False, will log a warning but will reindex all non-patch objects.

:returns List of objects that were reindexed.

splitgraph.core.table.merge_index_data(current_index: Dict[str, Any], new_index: Dict[str, Any])

splitgraph.core.types module

class splitgraph.core.types.Comparable

Bases: object

class splitgraph.core.types.TableColumn(ordinal, name, pg_type, is_pk, comment)

Bases: tuple

property comment

Alias for field number 4

property is_pk

Alias for field number 3

property name

Alias for field number 1

property ordinal

Alias for field number 0

property pg_type

Alias for field number 2

splitgraph.core.types.dict_to_tableschema(tables: Dict[str, Dict[str, Any]])Dict[str, List[splitgraph.core.types.TableColumn]]
splitgraph.core.types.tableschema_to_dict(tables: Dict[str, List[splitgraph.core.types.TableColumn]])Dict[str, Dict[str, str]]