Constraints
This section describes important constraints that have been identified and should be considered in the design of the architecture.
Existing infrastructure
- Estonian Biobank (research institution) is using High-Performance Computing Center (HPC) of University of Tartu for handling high-scale data-storage and data-processing.
- TEHIK is hosting the national health-data (but not yet genomic data), which is protected by the law, contracts, and technical means.
- Tartu University Hospital has its own technical infrastructure for sequencing and storing full genomes (clinical case).
- Kubernetes is currently the most common deployment environment. Minio is the most-likely storage service. Both of them are already in place and in active use.
Data Scope
- The GDI node needs to be able to incorporate full genomic data from both clinical and research institutions.
- Consistent pseudonyms need to be used for linking data to individuals.
- Data providers maintain but they can also use their data within the node.
- Data providers can update their data (more samples, new reference genomes).
- VCF is the most preferred data-format for research (and for extracting genomic variations for data-discovery).
- Other preferred file formats: BAM, CRAM.
- System should limit the size of files based on the storage system.
- Internally, system must store its data-files encrypted.
Interfaces
- The GDI node must provide a web-based user-interface: website with
documentation, contacts, and administration portal.
- User Portal functionality at GDI node level is initially not in the scope.
- The GDI node must support user authentication through the national authentication service (e.g. TARA in Estonia).
- The GDI node must provide following interfaces to the central User Portal:
- Data Catalogue
- Beacon v2 API
- The GDI node must provide an SPE (e.g. SAPU in HPC) for approved research.
- The GDI node should interface with existing storage solutions (S3).
- At this stage, other interfaces are not yet planned, but are provisioned (TEHIK, hospitals).
System Management
- The system must support more than one organisations who can manage their users.
- The system must provide a set of permissions for managing permitted actions per user.
- The system and user activity must be auditable.
- Background processes need to be visible in the system (may require a permission).
- Help-desk needs elevated management permissions.
Data Management
- Data providers manage their data as datasets, which consist of genomic and metadata.
- Genomic data is stored at the Data provider in their own storage, where the GDI node system is given read-access.
- Datasets are formed by dataset manifests that reference the dataset files by path.
- Datasets support versioning (through a naming convention).
- Visibility and accessibility of dataset-versions (but not content) must be user-controlled through the GDI node system.
- The system may support data-processing pipelines for various needs. For example: generating Beacon-compatible data from a VCF.
- The GDI node system must enable schema manangement for dataset manifests, as there can be more than one valid schemas depending on the data use case.
Data Access
- Before requesting access, a data-cohort must be defined in the system.
- The data providers having their data in the cohort must be notified of the data-access request.
- The data providers may veto the use of their data (within a predefined time-frame) before approved access becomes effective.
- Data providers may use user-permissions to permit data processing within the GDI node for their members.
- In case of external researchers, the system must check GA4GH Passport & Visa information in the JWT for checking user’s permission to read data.
- The genomic data within the GDI node storage cannot be exported from the system.
Data Analysis
- Setting up an SPE for data analysis is a manual interactive process between the researcher and the node help-desk. It includes signing a legal contract.
- The GDI node may charge the researcher for consumed resources (e.g. data storage, CPU, and memory).
- Data provider may require review of analysis results before the researcher can view them.
- Data provider has read-only access to the research project, output-files and logs.