splitgraph.ingestion package

Module contents

Submodules

splitgraph.ingestion.common module

class splitgraph.ingestion.common.IngestionAdapter

Bases: object

abstract create_ingestion_table(data, engine, schema: str, table: str, **kwargs)
abstract data_to_new_table(data, engine, schema: str, table: str, no_header: bool = True, **kwargs)
abstract query_to_data(engine, query: str, schema: Optional[str] = None, **kwargs)
to_data(query: str, image: Optional[Union[splitgraph.core.image.Image, str]] = None, repository: Optional[splitgraph.core.repository.Repository] = None, use_lq: bool = False, **kwargs)
to_table(data, repository: splitgraph.core.repository.Repository, table: str, if_exists: str = 'patch', schema_check: bool = True, no_header: bool = False, **kwargs)
splitgraph.ingestion.common.build_commandline_help(json_schema)
splitgraph.ingestion.common.dedupe_sg_schema(schema_spec: List[splitgraph.core.types.TableColumn], prefix_len: int = 59)List[splitgraph.core.types.TableColumn]

Some foreign schemas have columns that are longer than 63 characters where the first 63 characters are the same between several columns (e.g. odn.data.socrata.com). This routine renames columns in a schema to make sure this can’t happen (by giving duplicates a number suffix).

splitgraph.ingestion.common.merge_tables(engine: splitgraph.engine.postgres.engine.PsycopgEngine, source_schema: str, source_table: str, source_schema_spec: List[splitgraph.core.types.TableColumn], target_schema: str, target_table: str, target_schema_spec: List[splitgraph.core.types.TableColumn])
splitgraph.ingestion.common.schema_compatible(source_schema: List[splitgraph.core.types.TableColumn], target_schema: List[splitgraph.core.types.TableColumn])bool

Quick check to see if a dataframe with target_schema can be written into source_schema. There are some implicit type conversions that SQLAlchemy/Pandas can do so we don’t want to immediately fail if the column types aren’t exactly the same (eg bigint vs numeric etc). Most errors should be caught by PG itself.

Schema is a list of (ordinal, name, type, is_pk).

splitgraph.ingestion.inference module

splitgraph.ingestion.inference.infer_sg_schema(sample: List[Tuple[str, ]], override_types: Optional[Dict[str, Any]], primary_keys: Optional[List[str]] = None)
splitgraph.ingestion.inference.parse_boolean(boolean: str)

splitgraph.ingestion.pandas module

Routines that ingest/export CSV files to/from Splitgraph images using Pandas

class splitgraph.ingestion.pandas.PandasIngestionAdapter

Bases: splitgraph.ingestion.common.IngestionAdapter

static create_ingestion_table(data, engine, schema: str, table: str, **kwargs)
static data_to_new_table(data, engine: PsycopgEngine, schema: str, table: str, no_header: bool = True, **kwargs)
static query_to_data(engine, query: str, schema: Optional[str] = None, **kwargs)
splitgraph.ingestion.pandas.df_to_table(df: Union[pandas.core.series.Series, pandas.core.frame.DataFrame], repository: splitgraph.core.repository.Repository, table: str, if_exists: str = 'patch', schema_check: bool = True)None

Writes a Pandas DataFrame to a checked-out Splitgraph table. Doesn’t create a new image.

Parameters
  • df – Pandas DataFrame to insert.

  • repository – Splitgraph Repository object. Must be checked out.

  • table – Table name.

  • if_exists – Behaviour if the table already exists: ‘patch’ means that primary keys that already exist in the

table will be updated and ones that don’t will be inserted. ‘replace’ means that the table will be dropped and recreated. :param schema_check: If False, skips checking that the dataframe is compatible with the target schema.

splitgraph.ingestion.pandas.df_to_table_fast(engine: PsycopgEngine, df: Union[pandas.core.series.Series, pandas.core.frame.DataFrame], target_schema: str, target_table: str)
splitgraph.ingestion.pandas.sql_to_df(sql: str, image: Optional[Union[splitgraph.core.image.Image, str]] = None, repository: Optional[splitgraph.core.repository.Repository] = None, use_lq: bool = False, **kwargs)pandas.core.frame.DataFrame

Executes an SQL query against a Splitgraph image, returning the result.

Extra **kwargs are passed to Pandas’ read_sql_query.

Parameters
  • sql – SQL query to execute.

  • image – Image object, image hash/tag (str) or None (use the currently checked out image).

  • repository – Repository the image belongs to. Must be set if image is a hash/tag or None.

  • use_lq – Whether to use layered querying or check out the image if it’s not checked out.

Returns

A Pandas dataframe.