Dataset Preparation
- Version: 0.5
- Status: DRAFT
- Created: 2026-03-02
- Updated: 2026-03-02
- To be consulted:
- Kersti Jääger (kersti.jaager@ut.ee)
Preparing dataset files
Data-providers first need to generate the input files for the dataset tool. GDI data preparation tool accepts:
- VCF files as genomic data files
- Metadata file in YAML format describing the dataset (package.yaml)
- Sample file in a CSV file (only when sharing individual-level data).
Dataset metadata must specify whether a dataset includes aggregated or individual-level data (aggregated: true or false).
Example VCF (aggregated)

Metadata file (package.yaml)

Samples file (csv)

Preparing encrypted dataset package
Once all the input files have been generated, gdi-dataset-tool can be run to generate a dataset package that can be registered in the GDI to make it visible and findable in the User Portal.

Diagram 2. gdi-dataset-tool validates, transforms and collects input data files - a metadata file, a VCF file, and a samples file (only for individual-level data) - into a dataset package, and encrypts the package for secure upload.
The packaging involves:
- basic sanity checks for the content and match between the provided files (VCF, package.yaml, samples.csv)
- conversion of original files into Parquet files
- validating metadata to manifest file (manifest.yaml)
- packaging the files into a TAR file
- encryption of the TAR file using the public Crypt4GH key of the node and also the data-provider public key (which allows decryption of the package by the data-provider)
The samples file should be omitted in case of summary-level datasets. All these steps are carried out via a single command line tool:
python3 \-m gdi\_dataset\_tool package package\_dir/package.yaml
The output file is an encrypted dataset package (<dataset.internalId>.tar.c4gh) that can be deposited in the Shared storage (Minio S3) or uploaded to GDI Estonia via local user portal. The package includes:
- A dataset manifest (manifest.yaml): Name, description, keywords; File paths, size, MD5-checksums
- Beacon data: variants-chr{c}.{p}.parquet or allele_freq-chr{c}.{p}.parquet
- Sample data (only for individual-level data): samples.parquet
- VCF-file headers (for internal use in the local portal): headers/{checksum}.vcf
See GDI-EE Technical quality checks for more details on validation and gdi-dataset-tool for detailed technical description of the tool.
Next steps
Dataset Submission to GDI Estonia