Splitgraph has been acquired by EDB! Read the blog post.

splitgraph.ingestion.csv package

Submodules

splitgraph.ingestion.csv.common module

class splitgraph.ingestion.csv.common.CSVOptions(autodetect_header, autodetect_dialect, autodetect_encoding, autodetect_sample_size, schema_inference_rows, delimiter, quotechar, header, encoding, ignore_decode_errors)

Bases: tuple

autodetect_dialect: bool

Alias for field number 1

autodetect_encoding: bool

Alias for field number 2

autodetect_header: bool

Alias for field number 0

autodetect_sample_size: int

Alias for field number 3

delimiter: str

Alias for field number 5

encoding: str

Alias for field number 8

classmethod from_fdw_options(fdw_options)
header: bool

Alias for field number 7

ignore_decode_errors: bool

Alias for field number 9

quotechar: str

Alias for field number 6

schema_inference_rows: int

Alias for field number 4

to_csv_kwargs()
to_table_options()

Turn this into a dict of table options that can be plugged back into CSVDataSource.

splitgraph.ingestion.csv.common.autodetect_csv(stream: io.RawIOBase, csv_options: splitgraph.ingestion.csv.common.CSVOptions) splitgraph.ingestion.csv.common.CSVOptions

Autodetect the CSV dialect, encoding, header etc.

splitgraph.ingestion.csv.common.dump_options(options: Dict[str, Any]) Dict[str, str]
splitgraph.ingestion.csv.common.get_s3_params(fdw_options: Dict[str, Any]) Tuple[minio.api.Minio, str, str]
splitgraph.ingestion.csv.common.load_options(options: Dict[str, str]) Dict[str, Any]
splitgraph.ingestion.csv.common.log_to_postgres(*args, **kwargs)
splitgraph.ingestion.csv.common.make_csv_reader(response: io.IOBase, csv_options: splitgraph.ingestion.csv.common.CSVOptions) Tuple[splitgraph.ingestion.csv.common.CSVOptions, _csv._reader]
splitgraph.ingestion.csv.common.pad_csv_row(row: List[str], num_cols: int, row_number: int) List[str]

Preprocess a CSV file row to make the parser more robust.

splitgraph.ingestion.csv.fdw module

class splitgraph.ingestion.csv.fdw.CSVForeignDataWrapper(fdw_options, fdw_columns)

Bases: object

Foreign data wrapper for CSV files stored in S3 buckets or HTTP

can_sort(sortkeys)
execute(quals, columns, sortkeys=None)

Main Multicorn entry point.

explain(quals, columns, sortkeys=None, verbose=False)
get_rel_size(quals, columns)
classmethod import_schema(schema, srv_options, options, restriction_type, restricts)
splitgraph.ingestion.csv.fdw.log_to_postgres(*args, **kwargs)
splitgraph.ingestion.csv.fdw.report_errors(table_name: str)

Context manager that ignores exceptions and serializes them to JSON using PG’s notice mechanism instead. The data source is meant to load these to report on partial failures (e.g. failed to load one table, but not others).

Module contents

class splitgraph.ingestion.csv.CSVDataSource(engine: PostgresEngine, credentials: Credentials, params: Params, tables: Optional[Union[List[str], Dict[str, Tuple[List[splitgraph.core.types.TableColumn], TableParams]]]] = None)

Bases: splitgraph.hooks.data_source.fdw.ForeignDataWrapperDataSource

commandline_help: str = 'Mount CSV files in S3/HTTP.\n\nIf passed an URL, this will live query a CSV file on an HTTP server. If passed\nS3 access credentials, this will scan a bucket for CSV files, infer their schema\nand make them available to query over SQL.  \n\nFor example:  \n\n\x08\n```\nsgr mount csv target_schema -o@- <<EOF\n  &lbrace;\n    "s3_endpoint": "cdn.mycompany.com:9000",\n    "s3_access_key": "ABCDEF",\n    "s3_secret_key": "GHIJKL",\n    "s3_bucket": "data",\n    "s3_object_prefix": "csv_files/current/",\n    "autodetect_header": true,\n    "autodetect_dialect": true,\n    "autodetect_encoding": true\n  &rbrace;\nEOF\n```\n'
commandline_kwargs_help: str = "s3_access_key:\ns3_secret_key:\nconnection:\nautodetect_header: Detect whether the CSV file has a header automatically.\nautodetect_dialect: Detect the CSV file's dialect (separator, quoting characters etc) automatically.\nautodetect_encoding: Detect the CSV file's encoding automatically.\nautodetect_sample_size: Sample size, in bytes, for encoding/dialect/header detection.\nschema_inference_rows: Number of rows to use for schema inference.\nencoding: Encoding of the CSV file.\nignore_decode_errors: Ignore errors when decoding the file.\nheader: First line of the CSV file is its header.\ndelimiter: Character used to separate fields in the file.\nquotechar: Character used to quote fields."
credentials_schema: Dict[str, Any] = &lbrace;'properties': &lbrace;'s3_access_key': &lbrace;'type': 'string'&rbrace;, 's3_secret_key': &lbrace;'type': 'string'&rbrace;&rbrace;, 'type': 'object'&rbrace;
classmethod from_commandline(engine, commandline_kwargs) splitgraph.ingestion.csv.CSVDataSource

Instantiate an FDW data source from commandline arguments.

classmethod get_description() str
get_fdw_name()
classmethod get_name() str
get_raw_url(tables: Optional[Union[List[str], Dict[str, Tuple[List[splitgraph.core.types.TableColumn], TableParams]]]] = None, expiry: int = 3600) Dict[str, List[Tuple[str, str]]]

Get a list of public URLs for each table in this data source, e.g. to export the data as CSV. These may be temporary (e.g. pre-signed S3 URLs) but should be accessible without authentication. :param tables: A TableInfo object overriding the table params of the source :param expiry: The URL should be valid for at least this many seconds :return: Dict of table_name -> list of (mimetype, raw URL)

get_remote_schema_name() str

Override this if the FDW supports IMPORT FOREIGN SCHEMA

get_server_options()
get_table_options(table_name: str, tables: Optional[Union[List[str], Dict[str, Tuple[List[splitgraph.core.types.TableColumn], TableParams]]]] = None) Dict[str, str]
classmethod migrate_params(params: Params) Params
params_schema: Dict[str, Any] = &lbrace;'properties': &lbrace;'autodetect_dialect': &lbrace;'default': True, 'description': "Detect the CSV file's dialect (separator, quoting characters etc) automatically", 'type': 'boolean'&rbrace;, 'autodetect_encoding': &lbrace;'default': True, 'description': "Detect the CSV file's encoding automatically", 'type': 'boolean'&rbrace;, 'autodetect_header': &lbrace;'default': True, 'description': 'Detect whether the CSV file has a header automatically', 'type': 'boolean'&rbrace;, 'autodetect_sample_size': &lbrace;'default': 65536, 'description': 'Sample size, in bytes, for encoding/dialect/header detection', 'type': 'integer'&rbrace;, 'connection': &lbrace;'oneOf': [&lbrace;'type': 'object', 'required': ['connection_type', 'url'], 'properties': &lbrace;'connection_type': &lbrace;'type': 'string', 'const': 'http'&rbrace;, 'url': &lbrace;'type': 'string', 'description': 'HTTP URL to the CSV file'&rbrace;&rbrace;&rbrace;, &lbrace;'type': 'object', 'required': ['connection_type', 's3_endpoint', 's3_bucket'], 'properties': &lbrace;'connection_type': &lbrace;'type': 'string', 'const': 's3'&rbrace;, 's3_endpoint': &lbrace;'type': 'string', 'description': 'S3 endpoint (including port if required)'&rbrace;, 's3_region': &lbrace;'type': 'string', 'description': 'Region of the S3 bucket'&rbrace;, 's3_secure': &lbrace;'type': 'boolean', 'description': 'Whether to use HTTPS for S3 access'&rbrace;, 's3_bucket': &lbrace;'type': 'string', 'description': 'Bucket the object is in'&rbrace;, 's3_object': &lbrace;'type': 'string', 'description': 'Limit the import to a single object'&rbrace;, 's3_object_prefix': &lbrace;'type': 'string', 'description': 'Prefix for object in S3 bucket'&rbrace;&rbrace;&rbrace;], 'type': 'object'&rbrace;, 'delimiter': &lbrace;'default': ',', 'description': 'Character used to separate fields in the file', 'type': 'string'&rbrace;, 'encoding': &lbrace;'default': 'utf-8', 'description': 'Encoding of the CSV file', 'type': 'string'&rbrace;, 'header': &lbrace;'default': True, 'description': 'First line of the CSV file is its header', 'type': 'boolean'&rbrace;, 'ignore_decode_errors': &lbrace;'default': False, 'description': 'Ignore errors when decoding the file', 'type': 'boolean'&rbrace;, 'quotechar': &lbrace;'default': '"', 'description': 'Character used to quote fields', 'type': 'string'&rbrace;, 'schema_inference_rows': &lbrace;'default': 100000, 'description': 'Number of rows to use for schema inference', 'type': 'integer'&rbrace;&rbrace;, 'type': 'object'&rbrace;
supports_load = True
supports_mount = True
supports_sync = False
table_params_schema: Dict[str, Any] = &lbrace;'properties': &lbrace;'autodetect_dialect': &lbrace;'default': True, 'description': "Detect the CSV file's dialect (separator, quoting characters etc) automatically", 'type': 'boolean'&rbrace;, 'autodetect_encoding': &lbrace;'default': True, 'description': "Detect the CSV file's encoding automatically", 'type': 'boolean'&rbrace;, 'autodetect_header': &lbrace;'default': True, 'description': 'Detect whether the CSV file has a header automatically', 'type': 'boolean'&rbrace;, 'autodetect_sample_size': &lbrace;'default': 65536, 'description': 'Sample size, in bytes, for encoding/dialect/header detection', 'type': 'integer'&rbrace;, 'delimiter': &lbrace;'default': ',', 'description': 'Character used to separate fields in the file', 'type': 'string'&rbrace;, 'encoding': &lbrace;'default': 'utf-8', 'description': 'Encoding of the CSV file', 'type': 'string'&rbrace;, 'header': &lbrace;'default': True, 'description': 'First line of the CSV file is its header', 'type': 'boolean'&rbrace;, 'ignore_decode_errors': &lbrace;'default': False, 'description': 'Ignore errors when decoding the file', 'type': 'boolean'&rbrace;, 'quotechar': &lbrace;'default': '"', 'description': 'Character used to quote fields', 'type': 'string'&rbrace;, 's3_object': &lbrace;'description': 'S3 object of the CSV file', 'type': 'string'&rbrace;, 'schema_inference_rows': &lbrace;'default': 100000, 'description': 'Number of rows to use for schema inference', 'type': 'integer'&rbrace;, 'url': &lbrace;'description': 'HTTP URL to the CSV file', 'type': 'string'&rbrace;&rbrace;, 'type': 'object'&rbrace;
class splitgraph.ingestion.csv.CSVIngestionAdapter

Bases: splitgraph.ingestion.common.IngestionAdapter

static create_ingestion_table(data, engine, schema: str, table: str, **kwargs)
static data_to_new_table(data, engine: PostgresEngine, schema: str, table: str, no_header: bool = True, **kwargs)
static query_to_data(engine, query: str, schema: Optional[str] = None, **kwargs)
splitgraph.ingestion.csv.copy_csv_buffer(data, engine: PsycopgEngine, schema: str, table: str, no_header: bool = False, **kwargs)

Copy CSV data from a buffer into a given schema/table

splitgraph.ingestion.csv.query_to_csv(engine: PsycopgEngine, query, buffer, schema: Optional[str] = None)