Constraints

This section describes important constraints that have been identified and should be considered in the design of the architecture.

Existing infrastructure

Estonian Biobank (research institution) is using High-Performance Computing Center (HPC) of University of Tartu for handling high-scale data-storage and data-processing.
TEHIK is hosting the national health-data (but not yet genomic data), which is protected by the law, contracts, and technical means.
Tartu University Hospital has its own technical infrastructure for sequencing and storing full genomes (clinical case).
Kubernetes is currently the most common deployment environment. Minio is the most-likely storage service. Both of them are already in place and in active use.

The GDI node needs to be able to incorporate full genomic data from both clinical and research institutions.
Consistent pseudonyms need to be used for linking data to individuals.
Data providers maintain but they can also use their data within the node.
Data providers can update their data (more samples, new reference genomes).
VCF is the most preferred data-format for research (and for extracting genomic variations for data-discovery).
Other preferred file formats: BAM, CRAM.
System should limit the size of files based on the storage system.
Internally, system must store its data-files encrypted.

The GDI node must provide a web-based user-interface: website with documentation, contacts, and administration portal.
- User Portal functionality at GDI node level is initially not in the scope.
The GDI node must support user authentication through the national authentication service (e.g. TARA in Estonia).
The GDI node must provide following interfaces to the central User Portal:
- Data Catalogue
- Beacon v2 API
The GDI node must provide an SPE (e.g. SAPU in HPC) for approved research.
The GDI node should interface with existing storage solutions (S3).
At this stage, other interfaces are not yet planned, but are provisioned (TEHIK, hospitals).

The system must support more than one organisations who can manage their users.
The system must provide a set of permissions for managing permitted actions per user.
The system and user activity must be auditable.
Background processes need to be visible in the system (may require a permission).
Help-desk needs elevated management permissions.

Data providers manage their data as datasets, which consist of genomic and metadata.
Genomic data is stored at the Data provider in their own storage, where the GDI node system is given read-access.
Datasets are formed by dataset manifests that reference the dataset files by path.
Datasets support versioning (through a naming convention).
Visibility and accessibility of dataset-versions (but not content) must be user-controlled through the GDI node system.
The system may support data-processing pipelines for various needs. For example: generating Beacon-compatible data from a VCF.
The GDI node system must enable schema manangement for dataset manifests, as there can be more than one valid schemas depending on the data use case.

Before requesting access, a data-cohort must be defined in the system.
The data providers having their data in the cohort must be notified of the data-access request.
The data providers may veto the use of their data (within a predefined time-frame) before approved access becomes effective.
Data providers may use user-permissions to permit data processing within the GDI node for their members.
In case of external researchers, the system must check GA4GH Passport & Visa information in the JWT for checking user’s permission to read data.
The genomic data within the GDI node storage cannot be exported from the system.

Setting up an SPE for data analysis is a manual interactive process between the researcher and the node help-desk. It includes signing a legal contract.
The GDI node may charge the researcher for consumed resources (e.g. data storage, CPU, and memory).
Data provider may require review of analysis results before the researcher can view them.
Data provider has read-only access to the research project, output-files and logs.