Solution Strategy

It is clear that the system is complicated, must support different workflows and integrations, and must be secure. Below, the choices of strategies are explained below but here is the summary:

Decomposition of components by business task
Advanced storage solution: encrypt local files, support file minimisation, push and pull
Flexible metadata management
Internal data-processing pipelines
Enable data-processing in the system (SPE) over APIs
Pro-active system monitoring and visibility of events

Decomposition

The overview sections covered the many tasks that a GDI node must fulfill. In order to manage this complexity, this architecture proposes splitting the business tasks into microservices while presenting the functionality in a user-friendly web-based user-interface.

The GDI node backend microservices can be divided into two categories:

node-session – user authentication and establishing a local session;
node-manager – managing organisations, users, and organisational permissions; exposing node-level monitoring info;
data-manager – registration of genomic datasets; visibility management;
data-pipeline – managing and running registered pipeline scripts;

At this stage, the architecture provisions these 7 microservices listed above. (The list may change before and after version 1 of this architecture document.)

To serve the end-users, a GDI node needs to provide a single harmonised web-based user-interface consisting of

portal/introduction – a website for helping users (including local data-providers) to understand the GDI and start using it;
documentation – user-supporting material for performing specific tasks;
contact-information – where to write or call to get human support;
administration – depending on user’s permission, helps to manage the node-level information;
data-management – depending on user’s permission, helps to ingest and manage genomic and metadata

Once again, this UI has to look and feel as single UI, where the availability of features depends on user’s authentication and permissions. If user reaches non-accessible parts of the UI, error messages have to inform clearly why the user is not allowed to access it.

Development Tools

Besides the fact that the software must be deployed in Docker containers, here is the list of used techonologies:

Backend microservices:
- programming language: Python 3 (latest stable version)
- REST+JSON API development: FastAPI
- data-models: Pydantic
- rely on the tehnology toolkit (logging, config, etc) of the Python standard library
- configuration via environment variables, YAML files, and Vault secret management
Frontend user-interface:
- content-management: Hugo static website generator
- dynamic pages: Vue.js JavaScript framework
- background communication with the GDI node API over Fetch API and WebSocket
Continuous integration:
- building and publishing Docker images
- code quality checks (e.g. Black and ESLint)
- running various tests (at build, or scheduled)

Storage

GDI nodes need to appreciate the existance of existing storage systems with rich data, and should not attempt to replace existing storages. In addition, data-providers prefer to work in one storage system (their own, which is closer to them, and, therefore, also faster). To avoid data duplication (the files are very big – in giga- and terabytes), the GDI node system should just integrate with the storages of data providers without migrating the data. In case a pipeline or SPE needs the data (a specific file), it will be streamed through the GDI node system, which is another layer of permissions check.

Due to the data size and deployment environment (Kubernetes), the preferred data storage solution is Minio. Stored files need be encrypted, preferably using crypt4gh. The properties of the file and encryption need to be stored separately.

Metadata

Metadata management is complex as it will very certainly evolve over time. The system must be flexible to adopt new fields or upgrade existing ones. When the system should need to export data in specific data-formats, the internal data-structure does not have to match the exported one, but just adapt (convert) to the target format.

Often the metadata needs some predefined ontology values for specific fields. The system must enforce that correct values are provided by the users. For this purpose, the GDI node system is going to define its own metadata model that is easy to fill in for data providers, and that is easy to convert into targeted destination model/format.

Pipelines

As it is very common in the field of bioinformatics, data-processing pipelines are essential for working with big data. These pipelines are needed by researchers but GDI nodes can also take advantage of it for their internal data processing.

As the GDI node software is targeting Kubernetes-based deployment, the pipeline tasks will be run as Kubernetes jobs. The system is exepcted to record the executions and store the final result. Pipelines may access storage data on behalf of the user that triggered the pipeline. Initially, the pipelines will be designed for extracting and preparing the data catalogue (including Beacon).

Secure Processing Environment

Services in the GDI node, that permit analysis of the permitted data, need to conform to the requirements of “Secure Processing Environment”. Until these requirements are officially defined, here are the baseline assumptions:

Execution must work in isolation: no network, no shared file-system access.
Script may have external dependencies (files) to be downloaded before script execution (on behalf of the local help-desk).
User activity in the environment must be logged for future auditing.
Script logs and output-files may require review by the data provider before the researcher can access them.

Note that GDI nodes may provide additional “secure processing environment” options, for example graphical Linux environment running in a VM. How this option is introduced to the user, the setup, and access management is out of the scope of this document.

Auditing Events

The GDI node must record critical interaction events (who, what, when, where), including system activities.

For viewing the recorded events:

users can view their events;
special permission enables the user view all logs (by user/system, time-range, component, and/or priority).

Events must be stored separately from (meta/genomic) data, and also regularly backed up and archived.

System Notifications

Based on the assumption that users do not visit the system regularly, the system must notify users of asynchonous time-consuming action-points. Notifications need to be stored in the system and accessible through the user-interface. In addition, notifications need to be delivered via e-mail.