Introduction And Goals

European Genomic Data Infrastructure (GDI) project (2022–2026) is connecting national GDI nodes to the central User Portal where researches can discover and filter the datasets of full-sequenced human genomes accross all GDI nodes. The researches can also apply for data-access and, if approved, analyse the data in secure processing environments (SPE) of the GDI nodes. The data research is based on the “bring your code to the data” principle, as the sensitive data is not permitted to leave a node.

The General GDI Process

Part of the general solution, the GDI will establish a central node providing the User Portal and Beacon Network services, and defines the Life Science authentication and authorization infrastructure (AAI) as the main user-login service. The central service also periodically harvests data catalogues from the GDI nodes to build a local database of available datasets.

The User Portal provides the main user-interface, which exposes available datasets, and enables users to perform genome-level dataset filtering. Thus researches could discover potentially useful datasets with the count of matching individuals. The genomic queries are supported via Beacon Network instance where all the Beacon APIs of the GDI nodes are registered. The User Portal matches dataset referencese in Beacon responses to the ones in the collected data catalogue, and presents a filtered list of datasets to the user.

Researches can create a data-access request directly within the User Portal by selecting the desired (filtered) datasets. This step creates a research cohort consisting of selected datasets and the Beacon query (from filter) for defining matching samples within the datasets. The researcher should also specify desired file-formats and data-ranges (chromomes, positions, reads) that she wishes to analyse. The User Portal registers the research cohort (basically a dataset) and assigns it a unique ID.

To actually obtain access to the research cohort, the researcher (PI, principal investigator) also needs to create and send a data-acess application in the User Portal. The application needs to contain information about the targeted research cohort, description of the planned research, and also the people involved in the research (for they also need the access). Processing the application is too lengthy to cover here, however, it will eventually receive either the approved or denied status.

Once an application is approved, the GDI nodes (where the datasets are located) can locally prepare the research data, which includes the desired file-formats, data-ranges, and individuals (who have not prohibitied or opt-out the use of their data). The GDI nodes may prepare the data in advance (caching) or on demand for execution. However, it is important that the researches could at least view the list of available files within their research cohort. They need to reference the files (paths) in their data-analysis scripts.

The exact secure processing environment (SPE) description varies by node, as the requirements for SPEs are quite general: secure two-factor authentication, no network connection within SPE (during research), auditable user activity tracing, possibility to request data export that is reviewed by the data provider. However, at this stage, it seems that the User Portal is not involved in interaction process between the user and the allocated SPE.

Before a research project finally terminates, the members may wish to download the results. This process may require an approval from local data-providers to make sure that sensitive data is not exposed. The GDI nodes may preserve the files of terminated research projects, however, the metadata (research cohort, researcher’s data-access application) should be kept for historical purposes.

The Local GDI Process

From the perspective of local GDI nodes, the main actors are local data-providers who register their datasets and maintain the metadata. They are also part of the data-access process as they will be notified of applications targeting their data and they can also veto non-qualifying applications. Before exporting data-analysis results, data-providers have the option to review them to avoid leaking of sensitive data.

GDI node as a platform also needs some management (help-desk) to grant system access to the data providers and define their permissions. The data-providers (as clinical or research organisations) can also manage their own members and permissions.

All data provider’s data and metadata is attributed to the data provider’s organisation, and not to a particular user. The genomic data (files) are retained at the storage of the data provider. The GDI node system is granted permission to access the storage and read/share the files according to the approved data-access flow. The GDI node system only collects dataset attributes for exposing national data catalogue to the User Portal. In addition, it collects genomic data properties for responding to Beacon Network requests.

Genomic data (human full-genomes) is defined through one or more datasets that can be updated and extended over time. However, the datasets must be fixed periodically, therefore, these snapshots are defined as versions of the dataset. The stored personal metadata is kept minimal: pseudonym for the individual, bio-sample ID, person’s age during the sample collection, sex, and ethnicity. In future, it should be possible to combine the genomic data with health data from other registries. The system also collects metadata about the sequencing and data-derivation process.

It is important to note that bioinformatics has various data formats, each with a different precision and purpose. Therefore, a GDI node should be tolerant to different the file formats. However, the system is expected to have special support for BAM, CRAM, VCF, and BCF files for filtering specific reads or variants. The system should accept multiple individuals per file, as long as the system can filter the individual’s data from the file.

The GDI nodes should also consider the rights of individuals to opt-out from data-sharing, or opt-in again. This process is to be supported by data-providers who can make the changes in the system. However, this change only affects future research cohorts, and not the ones where the data has been already shared to researchers.

To summarise, local GDI nodes may have many roles to support the general GDI process at local level:

  1. Help-desk: manage organisations, members, and permissions.
  2. Data Providers:
    • manage their organisation, members, and permissions,
    • register and manage the visibility of their genomic datasets,
    • can veto data-access requests,
    • can approve or reject SPE data export requests.

On the technical side, local GDI nodes must

  1. expose their data catalogue to the User Portal
  2. expose their Beacon API to the Beacon Network
  3. enable secure dataset management for data providers
  4. provide secure processing environment (SPE) for an approved research.

Local GDI nodes may include their national trusted authentication services.

This architecture document explains the technical solution that supports the local GDI process described above. Note that the local GDI process itself is designed to support the general GDI process.

Quality Goals

The main requirements for the system are

  • security:
    • sensitive data must be protected (encryption, access-control, SPE);
    • access must be authorised (two-factor authentication);
    • actions must be logged for auditing;
  • cost-effectiveness
    • minimal costs for maintaining the system while the usage is low;
    • researchers may have to cover their SPE resource costs;
  • flexibility
    • support custom data-formats;
    • research data can be manually prepared for SPE;
    • permit inclusion of external databases and health data registries;
  • usability
    • simple enough to learn and navigate quickly;
    • the services can be accessed through common web browsers.

Stakeholders

Currently, following stakeholders have been identified:

  1. University of Tartu – data-processor of Estonian Biobank, also developing and operating the GDI node software.
    1. Institute of Genomics as one data-provider and a node manager
    2. High-Performance Computing Center (HPC) as the IT-service provider
    3. Institute of Computer Science as the system designer and developer
  2. Ministry of Social Affairs in Estonia – the main funder, also specifies the legal framework for the use of genomic and health data.
  3. TEHIK – IT service provider for national health-care information systems and clinical health-data.