Overview of GDI data operations
- Version: 0.5
- Status: DRAFT
- Created: 2026-01-20
- Updated: 2026-01-20
- To be consulted:
- Kersti JƤƤger (kersti.jaager@ut.ee)
- Erik Jaaniso (erik.jaaniso@ut.ee)
Introduction
As part of the European GDI project, Estonian GDI (GDI-EE) team at Institute of Computer Science, University of Tartu has developed an IT infrastructure that exposes national genomic dataset information to the central European User Portal for harmonised search of genomic data. The team has developed data pipelines that can prepare and ingest data and its associated metadata in a structured format and convert it into accessible yet protected data tables within its secure storage system. A graphical user-friendly web interface (GUI) was designed for data-providers to easily submit and manage their datasets. Dataset management includes the possibility to change dataset status and thereby control dataset visibility in the User Portal. The UI also enables data-provider organisations to have control over their accounts and associated permissions, a roles management feature not detailed in this piece of writing. Hereby, descriptions about data operations in the GDI-EE system are delivered to the node-holder organisation.
This overview describes processes that cover GDI-EE dataset lifecycle from data ingestion to data access via User Portal. Dataset transfer to SAPU, a sensitive data analysis platform at UT HPC (SAPU), and subsequent analyses in SAPU are only briefly listed for system completeness purposes, yet these processes will be entirely managed by UT HPC and will involve contractual operations established by the involved parties. Together with initial risk mapping (Risk Mapping), this information is aimed to support the node-holder organisation in understanding the logic behind GDI-EE data flow, identifying potential data privacy risks, and ensuring compliance with data protection regulations and organisational policies.
Dataset processing lifecycle
GDI-EE infrastructure architecture facilitates the processing of datasets along sequential steps in the dataset lifecycle. The data pipelines, preparation and managing steps in GDI-EE are together designed to support:
- The use of Variant Call Format (VCF) as an input data which will be converted to a variant discovery file that can be queried from via User Portal.
- The ingestion of Metadata in YAML format that is structured according to HealthDCAT-AP standard to enable dataset discovery via User Portal.
- The dataset registration and state management by data-providers in (local node portal).
For a structured description of the data operations, we first divided them into six core steps: 1) Dataset package preparation, 2) Dataset ingestion, 3) Dataset management, 4) Data search and access, 5) Data transfer to SAPU, 6) Data analysis in SAPU (Diagram 1), each detailed below. However, while handing over GDI solution to the future node-holder (2026): only operations 1-3 have been fully specified by GDI-EE; operation 4 being highly dependent on User Portal developments still underway; and operations 5-6 being specified by node-holder and involved parties after the GDI project.

Diagram 1. Dataset processing lifecycle of GDI-EE starts with creating a dataset package by data-provider (1) that will be ingested and registered in the GDI system (2). Dataset management (3) includes operations that data-providers can perform with registered datasets. Dataset properties are exposed to User Portal where a researcher can find and request access to data (4). Data transfer to and analysis in SAPU (5, 6) remain to be coordinated by data-provider and node-holder organisation to support secure environment setup and obtaining of research results.
Dataset package preparation (1.0)
Genomic variant data and associated metadata for the dataset and samples (optional) needs to be packaged by the data-provider before it can be entered into the GDI system. The packaging involves basic sanity checks for the content and match between the provided files (VCF, package.yaml, samples.csv) and conversion of original files into Parquet files and validating metadata to manifest file (manifest.yaml). The samples file should be omitted in case of summary-level datasets. See the list of differences between summary- and individual-level datasets in terms of content and sensitivity in Table 1. The resulting output is an encrypted dataset package file (.tar.c4gh) that can be deposited in the Shared storage (Minio S3) by the data-provider. See gdi-dataset-tool and compiling-a-dataset-package for technical details.
Dataset ingestion (2.0)
Dataset package can either be uploaded in the local portal or put into data-provider’s S3 storage for dataset ingestion into GDI-EE. The ingestion pipeline performs several processing steps including a) adding a new database record for unregistered dataset, b) validating files content of the package, c) updating database records (see Database tables in Table 2), d) encrypting Parquet files using keys from Vault, and e) saving files in node S3 storage. Upon successful validation, the package files will be removed from the data-providerās storage and dataset status will be updated to REGISTERED in the local portal. The requirements for the package files to pass validation are described here: manifest file (dataset-manifest), variant files (beacon-variants, beacon-allele-frequencies), samples file (dataset-samples). At this stage, dataset details are only accessible to the data-provider.
Dataset management (3.0)
Each dataset that is registered in the GDI-EE can be tracked and managed by the user via the GUI of the local portal. To control the status of the dataset, the user can choose between REGISTERED -> VISIBLE -> HIDDEN -> ARCHIVED statuses in the dataset lifecycle (datasets). However, to publish a dataset in the User Portal, dataset status must become VISIBLE. As a result, dataset is listed in the local FAIR Data Point catalogue where it is retrieved by the User Portal (fair-dp), and indexed data files are exposed to Beacon search operations. VISIBLE dataset and its files are given persistent GDI-EE identifiers. Local dataset management also allows to make a dataset invisible for the User Portal (HIDDEN) or turn it into ARCHIVED state which removes Beacon files from the storage (keeping only dataset metadata).
Table 1. Differences between summary- and individual-level data discovery services deployed by GDI nodes.
| Summary-level data | Individual-level data | |
|---|---|---|
| EU User Portal | Allele Frequency Browser | Data Catalogue |
| Beacon service | Allele Frequency Beacon | Sensitive (Regular) Beacon |
| Beacon instance | Aggregated Beacon | Subject-level Beacon |
| Access level | Public | Non-public* |
| Data sensitivity | Non-sensitive | Sensitive |
| Samples file in dataset package | No | Yes |
| Use case type | Genome of Europe variant data including pre-computed cohort-based allele frequencies (real) | Variant data supplemented with samplesā phenotypic data (synthetic) |
| Data structure | Specific for GoE project | Aligns with 1+MG HMD |
| Cohort | Population average | Patients |
| Beacon response | Record | Count |
| Authentication | Optional | Required |
| GDI Maximal Attainable Product (October 2026) | Required | Optional |
| Readiness | Production | Proof-of-concept |
*Implemented as Public at the moment using synthetic data (April 2026)
Data search and access (4.0)
EU-level dataset search is described by two distinct use cases that are supported by two Beacons: 1) Aggregated Beacon via User Portal Allele Frequency Browser for querying Genome of Europe allele-frequency data, 2) Subject-level Beacon via User Portal Data Catalogue for querying synthetic variant data. Note that the requests are similar for both Beacons whereas the minimal Beacon API response is either record (1) or count (2). Search filters for dataset are implemented according to HealthDCAT-AP v5 specification, while phenotypic filters are defined by Genome of Europe pilot use case, including pre-defined cohorts (metadata-of-genomic-datasets). Data search is possible for datasets that are exposed in the data catalog.
Due to the unfinalised state of the Genome EDIC, no central access governance procedures have been defined or implemented. Therefore, researchersā flow from finding an interesting dataset, filing an access application and receiving a permission for data analysis are still completely handled case-by-case via manual processes (see rems–the-permission-service for explanation). Therefore, no authentication (using Life-Science AAI) has been deployed in the User Portal to enable controlled access at this stage of the GDI project. All searchable data is classified as PUBLIC. Although Subject-level Beacon is by its nature designed for sensitive data (Table 1), no controlled-access Beacon requests are supported by current implementation.
Data transfer to SAPU (5.0)
We propose, however, that when a data-user has been granted access to genomic data, a cohort description file (Data cohort manifest) is generated that includes only data files specified in the access permission document. These files (Data cohort files) can be securely transferred from Shared storage to the SAPU data analysis environment. SAPU preparation and management processes are regulated according to service-provider rules.
Data analysis in SAPU (6.0)
Again, we propose that data analysis in GDI-EE takes place in SAPU at University of Tartu. No original data file can be downloaded from SAPU. Analysis results can be exported only upon receiving a permit from a dedicated authoritative organisation as specified in the Genome EDIC.
Data flow diagram
To visualise the connection between the described processes and sub-processes as well as the data stores they interact with, a data flow diagram structure has been used (Diagram 2). The main benefit of a data flow diagram is visualising processes that ādoā something with the data, and stores that ākeepā data in a certain file or location. This approach helps to visualise operations that are performed with data on top of the GDI-EE architecture.

Diagram 2. Data flow diagram showing data stores (yellow) associated with the relevant processes or sub-processes (purple) which have connection with two external entities (bordeaux): Shared storage (data-provider interaction with the system) and EU User Portal (data-user interaction with the system). The (S) for Samples file validation denotes an optional process only applicable to subject-level data used with Sensitive Beacon. Dashed lines indicate processes that have not been fully defined by the GDI project.
Data stores associated with data operations
Data in the node system is imported, processed and saved in diverse formats for use in specific scenarios. The following table provides a brief technical summary of the details behind data stores associated with the processes in Diagram 2.
Table 2. Technical details of the Data stores in the GDI-EE system visualised in Diagram 2.
| Data store | Data format | Contains | Encrypted | Associated process | Description |
|---|---|---|---|---|---|
| Dataset package | .tar.c4gh | Metadata, variant data, samples data (optional) | crypt4gh | Dataset package preparation | Prepared by data-provider and added in the GDI project bucket |
| Variant files | parquet | validated genomic variant data | crypt4gh | Dataset ingestion; Variant files validation; Data search | Uploaded to node S3 upon successful validation; a pre-requisite for dataset registration |
| Dataset manifest file | yaml | validated dataset metadata | crypt4gh | Dataset ingestion; dataset verification | Uploaded to node S3 upon successful validation; a pre-requisite for dataset registration |
| Samples file (optional) | parquet | validated samples phenotypic data | crypt4gh | Dataset ingestion; samples file validation (optional) | Uploaded to node S3 upon successful validation; a pre-requisite only for subject-level dataset registration |
| Database tables | database record | dataset title, description, internal ID, catalog, dataset files; samples count, individuals count, min/max age; record count | - | Dataset registration | Upon successful validation, Dataset is updated from UNREGISTERED to REGISTERED |
| Dataset status field | database record | dataset status | - | Dataset management; Catalogue data preparation; Data search | Possible statuses are: UNREGISTERED, REGISTERED, VISIBLE, HIDDEN, ARCHIVED; VISIBLE datasetsā properties will be published in a data catalogue at FAIR Data Point service |
| Access application | text document | research project proposal | - | Data access management | Managed according to national and institutional agreements; Remains to be specified at central level across Europe |
| Data cohort manifest | yaml | research cohort metadata | crypt4gh | Data transfer to SAPU | Prepared for dataset that has been granted access and permission for analysis in SAPU |
| Data cohort files | vcf, fastq, cram, bam, other | research cohort genomic data | crypt4gh | Data transfer to SAPU | Original genomic data files in data-provider storage that have a permission to be transferred to SAPU |
| Analysis results | analysed data (image, table, statistics) | results file | - | Data analysis in SAPU | Generated by the researcher during data analysis; stay in SAPU until permission for export is granted |
Data protection aspects
The data protection by design approach of GDI-EE system includes the following main features: 1) use of pseudonymised data where the de-pseudonymisation keys are stored outside GDI system, 2) transfer of data in an encrypted format where decryption keys are kept separately by secret management software, 3) using file checksum computation to protect and check for genomic data integrity within the system.
The most vulnerable component of the GDI architecture in terms of potential data leakage and privacy-risk is the Beacon data that must be exposed for requests via User Portal. However, the privacy-risk does not concern the queries performed via Allele Frequency Browser that is connected to aggregated Beacon which only delivers summary-level statistics (no individual data registered in the system). In fact, the first implementation of GDI is limited to queries of aggregated Genome of Europe data only, and restricting access to personal data within the national SPE or SAPU in our case. This is due to legal aspects that await decisions by the European Commission.