Dataset Preparation

  • Version: 0.5
  • Status: DRAFT
  • Created: 2026-03-02
  • Updated: 2026-03-02
  • To be consulted:

Preparing dataset files

Data-providers first need to generate the input files for the dataset tool. GDI data preparation tool accepts:

  1. VCF files as genomic data files
  2. Metadata file in YAML format describing the dataset (package.yaml)
  3. Sample file in a CSV file (only when sharing individual-level data).

Dataset metadata must specify whether a dataset includes aggregated or individual-level data (aggregated: true or false).

Example VCF (aggregated)

vcf

Metadata file (package.yaml)

package

Samples file (csv)

samples

Preparing encrypted dataset package

Once all the input files have been generated, gdi-dataset-tool can be run to generate a dataset package that can be registered in the GDI to make it visible and findable in the User Portal.


dataset-tool

Diagram 2. gdi-dataset-tool validates, transforms and collects input data files - a metadata file, a VCF file, and a samples file (only for individual-level data) - into a dataset package, and encrypts the package for secure upload.


The packaging involves:

  • basic sanity checks for the content and match between the provided files (VCF, package.yaml, samples.csv)
  • conversion of original files into Parquet files
  • validating metadata to manifest file (manifest.yaml)
  • packaging the files into a TAR file
  • encryption of the TAR file using the public Crypt4GH key of the node and also the data-provider public key (which allows decryption of the package by the data-provider)

The samples file should be omitted in case of summary-level datasets. All these steps are carried out via a single command line tool:

python3 \-m gdi\_dataset\_tool package package\_dir/package.yaml

The output file is an encrypted dataset package (<dataset.internalId>.tar.c4gh) that can be deposited in the Shared storage (Minio S3) or uploaded to GDI Estonia via local user portal. The package includes:

  1. A dataset manifest (manifest.yaml): Name, description, keywords; File paths, size, MD5-checksums
  2. Beacon data: variants-chr{c}.{p}.parquet or allele_freq-chr{c}.{p}.parquet
  3. Sample data (only for individual-level data): samples.parquet
  4. VCF-file headers (for internal use in the local portal): headers/{checksum}.vcf

See GDI-EE Technical quality checks for more details on validation and gdi-dataset-tool for detailed technical description of the tool.

Next steps

Dataset Submission to GDI Estonia