Data Quality Considerations

Version: 0.5
Status: DRAFT
Created: 2026-01-29
Updated: 2026-01-29
To be consulted:
- Kersti Jääger (kersti.jaager@ut.ee)

Introduction

The Genomic Data Infrastructure (GDI) platform requires data submissions to align with the 1+ Million Genomes (1+MG) governance strategy to ensure harmonised access and cross-node utility. This means that all steps from data generation to data discovery need to follow certain technical quality rules (Diagram 1). Much of this work is still ongoing both on the governance as well as on the technical side across Europe. This data quality document, therefore, gathers only the known requirements for data preparation described by the 1+MG framework, Genome of Europe (GoE) project and GDI technical infrastructure to be compliant with the first use case. The requirements include data models and standards, technical quality requirements and validation protocols established during the GDI project, using the GoE project as the pilot case.

data-quality

Diagram 1. Data quality needs to be ensured at all levels. Data harmonisation leads to publishing of reusable datasets.

1+MG Framework

1+MG initiative has established a 1+MG Framework to help users navigate the guidelines and recommendations provided by the 1+MG Working Groups and to support the initiative across the different phases of its implementation including the GDI project.

Sequencing and bioinformatic processing

Sequencing data generation is expected to follow certain quality rules both for lab and bioinformatic processing steps. 1+MG has released a document that describes the general agreed quality metrics and best practices for whole genome sequencing (WGS) starting from the library preparation techniques to the variant calling methods: Quality metrics for sequencing.

The work already started by the Beyond 1 Million Genomes (B1MG) initiative has been moved forward by the GoE Avant-Garde group to establish a smaller set of exclusion criteria for the GoE data including agreed threshold values. However, the work on a minimal set of hard filters and additional use-case specific quality control (QC) is still ongoing.

Preliminary list of minimal quality guidelines for GoE data:

Tissue source: Blood
Basic phenotype: age, sex, country of birth
Coverage (raw): 30x per sample
Effective coverage (mean/median depth per position): ±1
Evenness of coverage: This value should be as close to 1 as possible.
Percentage of genome with 20+ reads: 80%
Alignment genome: GRCh38
Variant calling pipeline: Dragen v4.4

Data type and format

According to GDI-wide vision, genomic data can be locally stored at each node/data-provider in any of the standard sequencing data formats (FASTQ, BAM, CRAM), while it is expected that for the dataset discovery, variant data in the VCF (v4.2 or later) format is processed into a Beacon-readable format to make it findable via User Portal. While BAM and CRAM formats act as comprehensive archives for raw genetic sequences and their alignment quality, VCF files function as efficient summaries that highlight only the specific mutations (variants) where an individual differs from a standard reference. The latter does not apply to genomic VCF (gVCF) files that include allele information for all genomic positions.

Data model and ontologies

1+MG harmonised minimal data (HMD) model has been developed to support harmonised data use via GDI. The HMD model includes standard element classes, vocabularies and ontologies for data collection; covers core phenotypic data including sequencing and analyses; and is applicable to specific use-cases including population genetics, cancer and infectious diseases. The supported ontologies include among others the NCIT, PROV-O, ICD-11, SNOMED and HL7 standards depending on the domain/class as specified in the model data sheets. HMD has been mapped to Beacon data model by GDI project members to support its implementation in GDI and allow harmonised data search via User Portal. Ontologies used in HMD have been applied to GoE project data to allow integration with the GDI.

Sanity checks

There is common understanding expressed by the domain experts that while assessing the compatibility of WGS with the 1+MG quality standards, sanity over strict quality and consistency over perfection is preferred. Briefly, sanity checks may be designed to control sequencing data and metadata for defects and errors like:

incomplete mapping
file duplication within the same dataset
mismatching numbers of objects
mismatching data formats
corrupted files
mismatching biological values between sequencing files and their metadata

GDI-EE technical quality checks

Dataset submission service (local portal) at Estonian node (GDI-EE) assumes that the variant files are pre-processed by the data-provider into searchable Beacon files (.parquet) and subsequently ingested into the GDI system. Dataset information should be provided in a metadata file (.yaml) and phenotypic information about samples/subjects in a separate tabular file (.csv). Data-providers can use a GDI-developed tool to perform the required data pre-processing which results in the generation of a single encrypted data package per dataset that can be securely transferred to the GDI system either via user interface or shared S3 storage. Both, dataset preparation tool and dataset ingestion pipeline perform basic validation to ensure that the submitted data is compatible with the GDI software system requirements. Crucially, the GDI system does not perform biological validation; the responsibility for the biology behind data and the use of appropriate laboratory techniques for generating the data relies with the data-provider.

Validation

Validation of dataset metadata and data files includes:

Checks for the properties and existence of the referred files in the metadata
Checks for the reference of at least 1 VCF in the metadata
Either verification or computation of the MD5 checksums
Checks during ingestion (manifest. yaml) - that the dataset files location is unique, their size and MD5 checksums are present (no validation of their accuracy), dataset.internalId is unique per data-provider
Validation during ingestion of data files (variants-chr{N}.{C}.parquet, allele-freq-chr{N}.{C}.parquet) for their columns and content

Validation of subject-level metadata includes:

Checks that the samples codes match between the VCF and the samples file
Checks for the required headers in the samples file: SAMPLE, INDIVIDUAL, SOURCE, DATE_COLLECTED, AGE_COLLECTED, SEX, ETHNICITY, GEOGRAPHIC_ORIGIN
Checks for the content of the samples file: no empty values for the required columns; no spaces in SAMPLE or INDIVIDUAL; SAMPLE values must be unique; DATE_COLLECTED and AGE_COLLECTED corresponding to an ISO 8601 standard
Validation during ingestion (samples.parquet) - of samples file for columns and content

Versioning

GDI-EE software system does not provide an internal versioning function at dataset level: every change in the dataset content - change/removal/addition of samples, allele frequencies, phenotypic or ELSI (Ethical, Legal, Societal issues) metadata etc - requires that the dataset is newly registered. Therefore, it is required that the data-provider maintains its own versioning system outside GDI in order to track changes to the dataset submitted to GDI. However, a data-provider can reference previous instances (versions) of a dataset in its metadata. The references are included in the dataset metadata published in the catalogue. Each published dataset is given a persistent identifier (eg ‘GOE-CC-ORG-NNN…’ or ‘GDI-CC-ORG-NNN…’ where CC = two-letter country code, ORG = institute abbreviation, NNN… = digits) that is visible in the User Portal.

Genome of Europe (GoE)

Data Stewardship and Quality

Data management working group of the GoE project has defined specific focus areas that will engage experts to establish metadata standards and quality checks, define the required ELSI checks, provide data flow guidelines for data stewards and define universal terminology to ensure common understanding across projects. The priority of these activities is to ensure that GoE data stewards can make data available within the GDI. This work will continue until March 2028 (end of GoE project) when we expect the best practices to be finalised and extended across 1+MG.

Pilot Use case

Due to legal constraints and delayed implementation of central governance for secondary use of genomic data during the GDI project, only primary use of GoE data could be supported by GDI. The pilot use case builds on a GoE reference database for individual variants that can be looked up in the User Portal via Allele-Frequency Browser (allele frequencies). All processing of data including allele-frequency calculations needs to be performed prior to submission to GDI.

Allele-Frequency dataset

GDI pilot use case proposal suggests sharing a single summary-level GoE dataset per node through an aggregated Beacon instance. Aggregated cohorts or subpopulations are formed by combining phenotypic attributes defined by the GoE project. Then variant-level statistics including allele-frequencies are pre-calculated for these subpopulations and written to the VCF file. Both the cohort definition and its coding scheme are still being discussed and agreed upon. Once this is done, it can be determined how the User Portal displays the cohort options and formulates the query that is consistent across nodes and Beacon Network.

Allele-Frequency calculator

Allele frequency (AF) refers to the rate at which a specific gene variant, or allele, appears within a given population. The accuracy of the final AFs is dependent on the integrity of the input data and adherence to quality control principles. These principles have been gathered in the AF calculator (AF-pipeline) shared across GDI and GoE projects. This pipeline first filters the input VCF file to remove low-quality data and related individuals, followed by stratified calculation of AFs.

Allele-Frequency data file

AF data file is a VCF (v4.2 or later) that has an extended meta-information for ‘INFO’ including the pre-calculated variant-level allele-frequencies for the GoE-agreed subpopulations. According to the current proposal, the AF coding scheme is AF_[country of birth: 2-letter ISO3166 code]_[sex: HL7 ‘BirthSex’].

Allele-Frequency dataset metadata

GDI dataset metadata structure is based on the HealthDCAT-AP standard (revision 5), combining a set of GoE-requirements. Supported metadata domains/classes are described in detail in the GDI GitHub repository. Currently, ensuring data quality is a decentralised process where individual nodes validate their structural integrity using specific SHACL shapes within their own Fair Data Points.