Pushing datasets

Pushing data to other engines

Like Git, sgr allows sharing datasets with other sgr engines in a decentralized fashion. The two-engine example showcases pushing data between two engines linked together with docker-compose.

Merging changes and pull requests

sgr does not currently support merging changes and pull requests. Instead, it assumes images are immutable and on push, it uploads all images that do not exist on the remote engine. You can optionally overwrite image tags that already exist, or push/pull only a single image.

We have found that pull requests aren't as useful for continuously changing datasets as they are for code. This is because most workflows involve ingesting data from outside sources (CSV files, scraped data, other databases) into the dataset, so it's better to apply any changes or patches to the source or intermediate dataset after it's already been ingested.

In this respect, sgr is closer to Docker than to Git. There is no point in submitting a pull request to a Docker image, as any results will be overwritten the next time the image is built. Instead, developers propose changes to the code that builds the image. Similarly, sgr supports the workflow in which developers submit pull requests to the Splitfile that builds the dataset, allowing proposed changes to persist in future datasets.

You can approximate a workflow similar to pull requests by using tags and forwarding a tag to the agreed-upon "master" version of the dataset. By default, like Docker, sgr includes a latest tag which is a "floating" tag to the image most recently pushed to its repository.