gdi-dataset-tool — README
- Version: 0.5
- Status: DRAFT
- Created: 2026-02-25
- Updated: 2026-02-25
- To be consulted:
- Erik Jaaniso (erik.jaaniso@ut.ee)
Overview
-
Purpose: Prepare a dataset package for upload to the GDI node by validating metadata to the manifest file, optionally converting samples.csv to Parquet, extracting VCF-derived Parquet files, assembling a TAR, and encrypting it with Crypt4GH.
-
Output (package): <dataset.internalId>.tar.c4gh is written next to the input metadata YAML (dataset.internalId is read from the metadata). Other subcommands are described below; package is the primary workflow.
-
Language/Runtime: Python 3.13 or Docker.
CLI
-
Primary workflow: package
gdi-dataset-tool package PATH [–public-key FILE] [–no-headers]
-
PATH: path to metadata YAML (package metadata).
-
--public-key FILE: PEM-formatted Crypt4GH public key file for the node; if omitted, fetched from node endpoint.
-
--no-headers: do not include VCF headers in the package. By default, VCF headers are included under headers/<md5>.vcf.
-
Output: <internalId>.tar.c4gh is written next to the metadata YAML. Fails if the file already exists.
-
-
Global option:
- -v, --verbose: enable debug logging (overrides LOG_LEVEL).
-
Other subcommands:
-
ls — list members of an encrypted package
gdi-dataset-tool ls PACKAGE [–order name|size|mtime[-desc]] -
unpack — decrypt and extract an encrypted package
gdi-dataset-tool unpack PACKAGE [-d DIR]- Default destination: ./<package-name-without-.tar.c4gh> in the current working directory.
-
pack — create a package from an existing directory (must contain a validated manifest.yaml and Parquet files)
gdi-dataset-tool pack DIR [-o FILE] [–public-key FILE]- Default output file: ./<internalId>.tar.c4gh in the current working directory if -o is not specified.
-
-
manifest — print manifest.yaml from a package or generate it from a metadata YAML
gdi-dataset-tool manifest PATH- PATH can be <package>.tar.c4gh or <package>.yaml.
-
samples — print the samples.parquet stored in an encrypted package
gdi-dataset-tool samples PACKAGE [-n N] [-s SEP] [-c COLS]- -n N: limit the number of data rows (0 prints only the header).
- -s SEP: output field separator (default: TAB).
- -c COLS: comma-separated list of column names to print (case-insensitive).
-
Alternative invocation: python -m gdi_dataset_tool <subcommand> …
-
Errors: UserError indicates an expected, user-fixable error and is printed without a stack trace; other unexpected errors are logged with details.
-
Colored logs are printed to stdout; a progress bar renders to stderr.
Environment variables
- LOG_LEVEL: logging threshold (DEBUG, INFO, WARNING, ERROR). Default INFO. Overridden by -v/–verbose.
- C4GH_PASSPHRASE: passphrase for the provider’s private key (if encrypted). Optional.
- TMPDIR, TEMP, TMP: override the temporary working directory used for intermediate files (VCF→Parquet outputs and the TAR before encryption). Otherwise, system default is used (e.g., /tmp).
Building
To build a wheel file and a standalone executable:
./build.sh
This script will: - Build a wheel file and put it into dist/ (along with its md5sum). - Build a standalone executable using PyInstaller and put it into dist/ (along with its md5sum). - Place intermediate build artifacts into build/.
Installation
-
From source
pip install . -
Or, given a prebuilt wheel (built using poetry build or ./build.sh):
pip install dist/gdi_dataset_tool-<version>-py3-none-any.whl -
Or, use the standalone executable (built using ./build.sh):
./dist/gdi-dataset-tool-<version> <subcommand> … -
Or, build a Docker image:
docker build -t gdi-dataset-tool .
Examples
- Create a package from metadata (primary usage):
gdi-dataset-tool package path/to/package.yaml
python -m gdi_dataset_tool package path/to/package.yaml
- With explicit node public key file and debug logging:
LOG_LEVEL=DEBUG gdi-dataset-tool package path/to/package.yaml --public-key node.pub
gdi-dataset-tool --verbose package path/to/package.yaml --public-key node.pub
-
Docker (using wrapper script; builds the image if not present). Notes:
-
The wrapper mirrors the Python CLI and auto-mounts paths.
-
For package, ls, manifest, samples: the input’s directory is the working directory inside the container.
-
For unpack and pack: your host current working directory (CWD) is the container working directory; the input is mounted read-only at /in.
-
If you need to reference files outside the mounted directories, use absolute host paths under /host (read-only).
./gdi-dataset-tool.sh package path/to/package.yaml
./gdi-dataset-tool.sh package path/to/package.yaml --no-headers
./gdi-dataset-tool.sh ls path/to/package.tar.c4gh --order size-desc
./gdi-dataset-tool.sh unpack path/to/package.tar.c4gh -d ./dir
./gdi-dataset-tool.sh pack path/to/dir -o ./package.tar.c4gh
./gdi-dataset-tool.sh manifest path/to/package.tar.c4gh
./gdi-dataset-tool.sh manifest path/to/package.yaml
./gdi-dataset-tool.sh samples path/to/package.tar.c4gh -n 5 -s “;” -c SAMPLE,SEX
# With explicit node public key via host mapping
LOG_LEVEL=DEBUG ./gdi-dataset-tool.sh package path/to/package.yaml --public-key /host/path/to/node.pub
-
Inputs
- package:
- <package>.yaml (metadata YAML). VCF files and any additional files (BAM, CRAM, FASTA, FASTQ, SAM, VCF, TEXT, PDF) referenced in the metadata must exist; relative paths are resolved against the YAML directory.
- If dataset.aggregated == false, samples.csv must be present next to the YAML and non-empty.
- pack:
- A directory containing:
- manifest.yaml (validated/created earlier).
- Beacon Parquet files: at least one of files matching variants.chr<chr>.<group>.br<n>.<md5>.parquet or allele-freq.chr<chr>.<group>.br<n>.<md5>.parquet.
- If any files matching variants.* are present, samples.parquet is required.
- A directory containing:
- unpack:
- <package>.tar.c4gh file.
- ls, samples:
- <package>.tar.c4gh file.
- manifest:
- Either <package>.tar.c4gh or <package>.yaml.
Output
- package:
- Writes <dataset.internalId>.tar.c4gh next to the input metadata YAML. The command fails if the file already exists.
- pack:
- Writes <internalId>.tar.c4gh to the current working directory by default, or to -o/–output if provided (the extension is normalized to .tar.c4gh).
- unpack:
- Extracts files to -d/–dest if provided; otherwise to ./<package-name-without-.tar.c4gh> in the current working directory. Prompts before overwriting existing files.
- ls, manifest, samples:
- Write human-readable output to stdout; they do not create or modify files.
Contents of an encrypted package TAR (.tar.c4gh):
- manifest.yaml
- samples.parquet (present only when dataset.aggregated == false)
- VCF-derived Parquet files:
- aggregated=true: allele-freq.chr{CHR}.{GROUP}.br{N}.{MD5}.parquet
- aggregated=false: variants.chr{CHR}.{GROUP}.br{N}.{MD5}.parquet where {N} is the block range in bases (from dataset.blockRange, default 10000000), {GROUP} = POS // N (or 0 when N = 0), and {MD5} is the MD5 of the source VCF file.
- VCF headers (optional; included by default):
- headers/<md5>.vcf where <md5> is the checksum of the source VCF file listed in the manifest
Packaging steps (package.py)
- Load node public key (file or endpoint) (crypto.py).
- If --public-key FILE is provided, the file must be a PEM Crypt4GH public key.
- Otherwise, GET https://gdi.ut.ee/system/crypt4gh/key.
- Ensure provider keypair (generate if missing) and load private/public keys (crypto.py).
- Provider keypair is stored as ~/.c4gh/key (private key) and ~/.c4gh/key.pub (public key).
- If keypair exists, it is validated; otherwise a fresh keypair is generated in-place (without a passphrase).
- If the private key is encrypted and C4GH_PASSPHRASE is set, the passphrase is used for loading it.
- Load and validate metadata YAML (manifest.py).
- Details: see package.yaml
- Build manifest (manifest.py).
- Example: see manifest.yaml
- dataset.aggregated is intentionally omitted from manifest; it influences processing only.
- dataset.generatedBy is set to <tool> v<version> to identify the producing tool and version.
- In files section, the first group corresponds to vcfFiles with computed md5/size for each file.
- Then, additional groups appear as specified; md5/size are verified against provided values where applicable.
- Relative paths are resolved against the metadata YAML’s directory.
- If aggregated=false, parse samples.csv and convert to Parquet (samples.py).
- Details: see samples.csv
- Process VCFs to Parquet in a temporary directory; validate ordering and constraints; write files per chromosome/group (vcf.py).
- Details: see VCF processing
- The total number of variants processed across all VCF files is tracked and stored in dataset.recordsCount in the manifest.
- Create TAR (no compression) containing manifest.yaml, optional samples.parquet, generated Parquet files, and (by default) VCF headers under headers/<md5>.vcf (package.py).
- Encrypt TAR to Crypt4GH (.tar.c4gh) for recipients [node, provider] (crypto.py).
package.yaml

manifest.yaml

samples.csv
- Location: same directory as the metadata YAML (expected filename: samples.csv).
- Delimiter: auto-detected among , ; \t |
- Required header columns (case-insensitive, order-insensitive):
- SAMPLE, INDIVIDUAL, SOURCE, DATE_COLLECTED, AGE_COLLECTED, SEX, ETHNICITY, GEOGRAPHIC_ORIGIN
- Row validation:
- Non-empty values for all required columns.
- No spaces in SAMPLE or INDIVIDUAL.
- DATE_COLLECTED must be YYYY-MM-DD (ISO date) and is stored in Parquet as date32.
- AGE_COLLECTED must match regex for ISO 8601 period P<y>Y<m>M<d>D, e.g. P18Y6M3D; numbers are not range-validated; individual parts (Y, M, D) are optional.
- SOURCE ∈ {BLOOD, OTHER}; SEX ∈ {MALE, FEMALE, OTHER}; ETHNICITY ∈ {ESTONIAN, OTHER}; GEOGRAPHIC_ORIGIN ∈ {ESTONIA, OTHER}.
- Values for SOURCE, SEX, ETHNICITY, and GEOGRAPHIC_ORIGIN are case-insensitive and normalized to uppercase.
- SAMPLE values must be unique.
VCF processing
- Input formats: .vcf or .vcf.gz.
- Chromosomes accepted: 1–22, X, Y, M (MT is normalized to M). chr prefix is tolerated and stripped; labels are uppercased. Others cause a validation error.
- Variant position: POS is converted to 0-based in output (POS-1).
- REF validation: must be uppercase and contain only A/C/G/T/N.
- ALT support: only literal A/C/G/T/N strings are supported. Ignored cases: symbolic alleles (<…>), breakend notation ([]), missing value (.), single breakends (leading/trailing .), and overlapping deletion (*). Any other invalid ALT raises an error.
- Variant type (VT): derived from ref/alt after trimming common prefix/suffix: SNP if len==1 and bases differ; INDEL if lengths differ; UNKNOWN otherwise.
- Variants must be in non-decreasing POS order within a chromosome.
- Parquet partitioning is by chromosome and by GROUP = POS // N, where N is the block range in bases from metadata field dataset.blockRange (default 10000000). If N = 0, no splitting is performed and GROUP = 0 (a single file per chromosome).
- Chromosomes cannot re-appear after being written to Parquet (within the same input VCF file).
Non-aggregated mode (dataset.aggregated=false):
- Output filename: variants.chr{CHR}.{GROUP}.br{N}.{MD5}.parquet
- samples.csv is mandatory and must not be empty.
- VCF sample names must match exactly the SAMPLE set from samples.csv (order-insensitive). Any difference raises an error.
- Genotype handling: haploid or diploid (ploidy must not exceed 2); rows with all samples matching REF or with unknown genotypes are skipped.
- For each ALT with at least one sample carrying it, a row is written with columns:
- POS: int32, REF, ALT, VT, SAMPLES where SAMPLES is a compacted comma-separated list of zero-based sample index ranges (e.g., “0-3,5,7-8”).
Aggregated mode (dataset.aggregated=true):
- Output filename: allele-freq.chr{CHR}.{GROUP}.br{N}.{MD5}.parquet
- samples.csv is ignored.
- VCF sample names are ignored.
- Genotype information is ignored.
- INFO fields validated (header): AF, AC, AC_Hom, AC_Het, AC_Hemi must have Number=A; AN may be Number=1 (scalar) or Number=A (list). All AN* fields must use the same Number value. Values must be non-negative numbers.
- Population-aware INFO fields are supported: AF_<POP>, AC_<POP>, AC_<POP>_Het, AC_<POP>_Hom, AC_<POP>_Hemi, AN_<POP> with optional sex suffix _M or _F (e.g., AF_EE, AC_FI_M, AC_FI_M_Het, AN_M). <POP> must be a two-letter uppercase code.
- Plain (non-suffixed) fields are mapped to population label Total and can coexist with population-specific fields.
- Fields like AF_*_Het or AF_*_Hom are ignored (they are not used as allele-frequency inputs), e.g. AF_FI_Het, AF_EE_Hom, AF_FI_M_Hom, AF_EE_F_Het.
- For each ALT allele that passes validation, rows are written with columns: POS: int32, REF, ALT, VT, POPULATION, AF: float32, AC: int32, AC_HOM: int32, AC_HET: int32, AC_HEMI: int32, AN: int32.
- Rows are emitted only when the AF INFO field for that population is present in the header and has a value for the given population+ALT. The tool does not synthesize AF from AC/AN.
- Other INFO fields may be missing per record (written as nulls), and record-only INFO values not declared in the header are ignored.