gdi-dataset-tool — README

Version: 0.5
Status: DRAFT
Created: 2026-02-25
Updated: 2026-02-25
To be consulted:
- Erik Jaaniso (erik.jaaniso@ut.ee)

Overview

Purpose: Prepare a dataset package for upload to the GDI node by validating metadata to the manifest file, optionally converting samples.csv to Parquet, extracting VCF-derived Parquet files, assembling a TAR, and encrypting it with Crypt4GH.
Output (package): <dataset.internalId>.tar.c4gh is written next to the input metadata YAML (dataset.internalId is read from the metadata). Other subcommands are described below; package is the primary workflow.
Language/Runtime: Python 3.13 or Docker.

CLI

Primary workflow: package

gdi-dataset-tool package PATH [–public-key FILE] [–no-headers]
- PATH: path to metadata YAML (package metadata).
- --public-key FILE: PEM-formatted Crypt4GH public key file for the node; if omitted, fetched from node endpoint.
- --no-headers: do not include VCF headers in the package. By default, VCF headers are included under headers/<md5>.vcf.
- Output: <internalId>.tar.c4gh is written next to the metadata YAML. Fails if the file already exists.
Global option:
- -v, --verbose: enable debug logging (overrides LOG_LEVEL).
Other subcommands:
- ls — list members of an encrypted package
  gdi-dataset-tool ls PACKAGE [–order name|size|mtime[-desc]]
- unpack — decrypt and extract an encrypted package
  gdi-dataset-tool unpack PACKAGE [-d DIR]
  - Default destination: ./<package-name-without-.tar.c4gh> in the current working directory.
- pack — create a package from an existing directory (must contain a validated manifest.yaml and Parquet files)
  gdi-dataset-tool pack DIR [-o FILE] [–public-key FILE]
  - Default output file: ./<internalId>.tar.c4gh in the current working directory if -o is not specified.
manifest — print manifest.yaml from a package or generate it from a metadata YAML
gdi-dataset-tool manifest PATH
- PATH can be <package>.tar.c4gh or <package>.yaml.
samples — print the samples.parquet stored in an encrypted package
gdi-dataset-tool samples PACKAGE [-n N] [-s SEP] [-c COLS]
- -n N: limit the number of data rows (0 prints only the header).
- -s SEP: output field separator (default: TAB).
- -c COLS: comma-separated list of column names to print (case-insensitive).
Alternative invocation: python -m gdi_dataset_tool <subcommand> …
Errors: UserError indicates an expected, user-fixable error and is printed without a stack trace; other unexpected errors are logged with details.
Colored logs are printed to stdout; a progress bar renders to stderr.

Environment variables

LOG_LEVEL: logging threshold (DEBUG, INFO, WARNING, ERROR). Default INFO. Overridden by -v/–verbose.
C4GH_PASSPHRASE: passphrase for the provider’s private key (if encrypted). Optional.
TMPDIR, TEMP, TMP: override the temporary working directory used for intermediate files (VCF→Parquet outputs and the TAR before encryption). Otherwise, system default is used (e.g., /tmp).

Building

To build a wheel file and a standalone executable:
./build.sh

This script will: - Build a wheel file and put it into dist/ (along with its md5sum). - Build a standalone executable using PyInstaller and put it into dist/ (along with its md5sum). - Place intermediate build artifacts into build/.

Installation

From source
pip install .
Or, given a prebuilt wheel (built using poetry build or ./build.sh):
pip install dist/gdi_dataset_tool-<version>-py3-none-any.whl
Or, use the standalone executable (built using ./build.sh):
./dist/gdi-dataset-tool-<version> <subcommand> …
Or, build a Docker image:
docker build -t gdi-dataset-tool .

Examples

Create a package from metadata (primary usage):
gdi-dataset-tool package path/to/package.yaml

python -m gdi_dataset_tool package path/to/package.yaml

With explicit node public key file and debug logging:
LOG_LEVEL=DEBUG gdi-dataset-tool package path/to/package.yaml --public-key node.pub

gdi-dataset-tool --verbose package path/to/package.yaml --public-key node.pub

Docker (using wrapper script; builds the image if not present). Notes:
- The wrapper mirrors the Python CLI and auto-mounts paths.
- For package, ls, manifest, samples: the input’s directory is the working directory inside the container.
- For unpack and pack: your host current working directory (CWD) is the container working directory; the input is mounted read-only at /in.
- If you need to reference files outside the mounted directories, use absolute host paths under /host (read-only).
  
  ./gdi-dataset-tool.sh package path/to/package.yaml
./gdi-dataset-tool.sh package path/to/package.yaml --no-headers

./gdi-dataset-tool.sh ls path/to/package.tar.c4gh --order size-desc

./gdi-dataset-tool.sh unpack path/to/package.tar.c4gh -d ./dir

./gdi-dataset-tool.sh pack path/to/dir -o ./package.tar.c4gh

./gdi-dataset-tool.sh manifest path/to/package.tar.c4gh

./gdi-dataset-tool.sh manifest path/to/package.yaml

./gdi-dataset-tool.sh samples path/to/package.tar.c4gh -n 5 -s “;” -c SAMPLE,SEX

# With explicit node public key via host mapping

LOG_LEVEL=DEBUG ./gdi-dataset-tool.sh package path/to/package.yaml --public-key /host/path/to/node.pub

Inputs

package:
- <package>.yaml (metadata YAML). VCF files and any additional files (BAM, CRAM, FASTA, FASTQ, SAM, VCF, TEXT, PDF) referenced in the metadata must exist; relative paths are resolved against the YAML directory.
- If dataset.aggregated == false, samples.csv must be present next to the YAML and non-empty.
pack:
- A directory containing:
  - manifest.yaml (validated/created earlier).
  - Beacon Parquet files: at least one of files matching variants.chr<chr>.<group>.br<n>.<md5>.parquet or allele-freq.chr<chr>.<group>.br<n>.<md5>.parquet.
  - If any files matching variants.* are present, samples.parquet is required.
unpack:
- <package>.tar.c4gh file.
ls, samples:
- <package>.tar.c4gh file.
manifest:
- Either <package>.tar.c4gh or <package>.yaml.

Output

package:
- Writes <dataset.internalId>.tar.c4gh next to the input metadata YAML. The command fails if the file already exists.
pack:
- Writes <internalId>.tar.c4gh to the current working directory by default, or to -o/–output if provided (the extension is normalized to .tar.c4gh).
unpack:
- Extracts files to -d/–dest if provided; otherwise to ./<package-name-without-.tar.c4gh> in the current working directory. Prompts before overwriting existing files.
ls, manifest, samples:
- Write human-readable output to stdout; they do not create or modify files.

Contents of an encrypted package TAR (.tar.c4gh):

manifest.yaml
samples.parquet (present only when dataset.aggregated == false)
VCF-derived Parquet files:
- aggregated=true: allele-freq.chr{CHR}.{GROUP}.br{N}.{MD5}.parquet
- aggregated=false: variants.chr{CHR}.{GROUP}.br{N}.{MD5}.parquet where {N} is the block range in bases (from dataset.blockRange, default 10000000), {GROUP} = POS // N (or 0 when N = 0), and {MD5} is the MD5 of the source VCF file.
VCF headers (optional; included by default):
- headers/<md5>.vcf where <md5> is the checksum of the source VCF file listed in the manifest

Packaging steps (package.py)

Load node public key (file or endpoint) (crypto.py).
- If --public-key FILE is provided, the file must be a PEM Crypt4GH public key.
- Otherwise, GET https://gdi.ut.ee/system/crypt4gh/key.
Ensure provider keypair (generate if missing) and load private/public keys (crypto.py).
- Provider keypair is stored as ~/.c4gh/key (private key) and ~/.c4gh/key.pub (public key).
- If keypair exists, it is validated; otherwise a fresh keypair is generated in-place (without a passphrase).
- If the private key is encrypted and C4GH_PASSPHRASE is set, the passphrase is used for loading it.
Load and validate metadata YAML (manifest.py).
- Details: see package.yaml
Build manifest (manifest.py).
- Example: see manifest.yaml
- dataset.aggregated is intentionally omitted from manifest; it influences processing only.
- dataset.generatedBy is set to <tool> v<version> to identify the producing tool and version.
- In files section, the first group corresponds to vcfFiles with computed md5/size for each file.
- Then, additional groups appear as specified; md5/size are verified against provided values where applicable.
- Relative paths are resolved against the metadata YAML’s directory.
If aggregated=false, parse samples.csv and convert to Parquet (samples.py).
- Details: see samples.csv
Process VCFs to Parquet in a temporary directory; validate ordering and constraints; write files per chromosome/group (vcf.py).
- Details: see VCF processing
- The total number of variants processed across all VCF files is tracked and stored in dataset.recordsCount in the manifest.
Create TAR (no compression) containing manifest.yaml, optional samples.parquet, generated Parquet files, and (by default) VCF headers under headers/<md5>.vcf (package.py).
Encrypt TAR to Crypt4GH (.tar.c4gh) for recipients [node, provider] (crypto.py).

package.yaml

package

manifest.yaml

manifest

samples.csv

Location: same directory as the metadata YAML (expected filename: samples.csv).
Delimiter: auto-detected among , ; \t |
Required header columns (case-insensitive, order-insensitive):
- SAMPLE, INDIVIDUAL, SOURCE, DATE_COLLECTED, AGE_COLLECTED, SEX, ETHNICITY, GEOGRAPHIC_ORIGIN
Row validation:
- Non-empty values for all required columns.
- No spaces in SAMPLE or INDIVIDUAL.
- DATE_COLLECTED must be YYYY-MM-DD (ISO date) and is stored in Parquet as date32.
- AGE_COLLECTED must match regex for ISO 8601 period P<y>Y<m>M<d>D, e.g. P18Y6M3D; numbers are not range-validated; individual parts (Y, M, D) are optional.
- SOURCE ∈ {BLOOD, OTHER}; SEX ∈ {MALE, FEMALE, OTHER}; ETHNICITY ∈ {ESTONIAN, OTHER}; GEOGRAPHIC_ORIGIN ∈ {ESTONIA, OTHER}.
- Values for SOURCE, SEX, ETHNICITY, and GEOGRAPHIC_ORIGIN are case-insensitive and normalized to uppercase.
- SAMPLE values must be unique.

VCF processing

Input formats: .vcf or .vcf.gz.
Chromosomes accepted: 1–22, X, Y, M (MT is normalized to M). chr prefix is tolerated and stripped; labels are uppercased. Others cause a validation error.
Variant position: POS is converted to 0-based in output (POS-1).
REF validation: must be uppercase and contain only A/C/G/T/N.
ALT support: only literal A/C/G/T/N strings are supported. Ignored cases: symbolic alleles (<…>), breakend notation ([]), missing value (.), single breakends (leading/trailing .), and overlapping deletion (*). Any other invalid ALT raises an error.
Variant type (VT): derived from ref/alt after trimming common prefix/suffix: SNP if len==1 and bases differ; INDEL if lengths differ; UNKNOWN otherwise.
Variants must be in non-decreasing POS order within a chromosome.
Parquet partitioning is by chromosome and by GROUP = POS // N, where N is the block range in bases from metadata field dataset.blockRange (default 10000000). If N = 0, no splitting is performed and GROUP = 0 (a single file per chromosome).
Chromosomes cannot re-appear after being written to Parquet (within the same input VCF file).

Non-aggregated mode (dataset.aggregated=false):

Output filename: variants.chr{CHR}.{GROUP}.br{N}.{MD5}.parquet
samples.csv is mandatory and must not be empty.
VCF sample names must match exactly the SAMPLE set from samples.csv (order-insensitive). Any difference raises an error.
Genotype handling: haploid or diploid (ploidy must not exceed 2); rows with all samples matching REF or with unknown genotypes are skipped.
For each ALT with at least one sample carrying it, a row is written with columns:
- POS: int32, REF, ALT, VT, SAMPLES where SAMPLES is a compacted comma-separated list of zero-based sample index ranges (e.g., “0-3,5,7-8”).

Aggregated mode (dataset.aggregated=true):

Output filename: allele-freq.chr{CHR}.{GROUP}.br{N}.{MD5}.parquet
samples.csv is ignored.
VCF sample names are ignored.
Genotype information is ignored.
INFO fields validated (header): AF, AC, AC_Hom, AC_Het, AC_Hemi must have Number=A; AN may be Number=1 (scalar) or Number=A (list). All AN* fields must use the same Number value. Values must be non-negative numbers.
Population-aware INFO fields are supported: AF_<POP>, AC_<POP>, AC_<POP>_Het, AC_<POP>_Hom, AC_<POP>_Hemi, AN_<POP> with optional sex suffix _M or _F (e.g., AF_EE, AC_FI_M, AC_FI_M_Het, AN_M). <POP> must be a two-letter uppercase code.
- Plain (non-suffixed) fields are mapped to population label Total and can coexist with population-specific fields.
- Fields like AF_*_Het or AF_*_Hom are ignored (they are not used as allele-frequency inputs), e.g. AF_FI_Het, AF_EE_Hom, AF_FI_M_Hom, AF_EE_F_Het.
For each ALT allele that passes validation, rows are written with columns: POS: int32, REF, ALT, VT, POPULATION, AF: float32, AC: int32, AC_HOM: int32, AC_HET: int32, AC_HEMI: int32, AN: int32.
Rows are emitted only when the AF INFO field for that population is present in the header and has a value for the given population+ALT. The tool does not synthesize AF from AC/AN.
Other INFO fields may be missing per record (written as nulls), and record-only INFO values not declared in the header are ignored.