Architectural Decisions

Important architecture-related decision are noted here. The convention for documenting decisions:

choose the most apporopriate section
explain the context of the decision
explain the considered alternatives
Add “Since” when the version of the architecture document version is greater than 1.0.

Development Tools

Python as the main programming language

Context: we prefer to use a single programming language for all developments as it is easier for developers to switch between components. This preferene does not extend to externally developed components (e.g. Funnel in Go, Molgenis in Java). The team chose Python (latest stable version) as they have most experience with it. It’s a very popular open-source programming language and therefore has good community support. The team has confidence that Python can meet the demands of big data, as the language is often used in data science.

Alternatives: Go and Rust were considered as they are compiled languages and also quite commonly used for developing API-services. However, the development team does not have strong experience with them.

Poetry as a dependency management tool

Context: the Python code has some dependant libraries that need to be managed by version number. There are many alternatives available, each with their own style. The team has good experience with Poetry, so they decided to continue with that. The tool has low imapct, and can be switched easily, if it should become stale. Pyenv is the preferred tool for having a virtual Python environment locally (not used on deployment).

Alternatives: pip, conda.

Docker as the main container builder and runner

Context: since the software is delivered as Docker images, development needs to target container-based deployment early on. This includes figuring out the best configuration and storage methods. Although, the target environment is Kubernetes (linux/amd64 architecture), developers also need to write Docker Compose scripts for running the containerised software locally.

It is important but sometimes difficult challenge to run the containers with non-privileged operating system roles. However, no container running on Kubernetes must require privileged (root user) roles.

Alternatives: podman, singalarity also run Docker images, and it’s possible that the Kubernetes environment may use any of these tools instead of the Docker. However, all of these tools support Docker images as the format is standardised. The benefit of podman and singularity is that they can execute the containers in non-privileged modes, by default.

Storages

Minio as the default file-storage solution

Context: the system needs a solid file-storage solution for a multi-node deployment, which would be suitable for the Kubernetes-based deployment. Minio is a popular solution that provides Amazon S3 compatible HTTP-based API, as well as access-management. Minio storage service can serve multiple deployments. The benefit of choosing Minio is that it is already deployed on many existing infrastructures (including HPC at University of Tartu).

Alternatives: Longhorn is a popular storage solution for Kubernetes, however, in this case, Kubernetes is more involved in the data managament (physical volume and physical volume claims). The advantage of Minio is that the application can communicate directly with the storage service. The software uses Minio API for interacting with the content, not a file-system API. In addition, Longhorn is more bound by the limits of the chosen file-system. It might be an issue with really huge files (Minio supports 50 TiB files). Traditional remote storage solutions (NFS, FTP) are considered too sophisticated to manage. Network file-system GPFS, although already actively used, is too similar to a traditional file-system, and as a technical choice not as convenient to deploy as Minio.

PostgreSQL as the relational database management software

Context: the software must store its data in a scalable and easily retrievable manner. As a common practice, relational database is the default option, as administrators already have experience with it. In addition, relational databases provide good access-permission and user management options. In this architecture, PostgreSQL was chosen to provide the relational database system as it is well-known and so far the experience has been positive. PostgreSQL is an open-source software, has strong community and good track-record.

Alternatives: there are quite many alternatives (e.g. MySQL, or non-SQL databases like Mongo), however, their advanced features do not provide additional value for managing the system data.

Security

Encryption of genomic data with crypt4gh

Context: although file-systems (including Minio service) can provide data-encryption when content is stored on a disk, the data is still exposed, if the user-credentials should leak. Therefore, the file-content should be additionally encrypted, while the keys are stored separately from the file-storage. Attacker would have to gain access to both the encryption keys, file-storage credentials, and the file-storage itself to decypher the content.

Crypt4gh, a standard from GA4GH, was chosen by the GDI project as the recommended choice for encrypting the content. Its strength is handling encryption of really big files, while securely and efficiently sharing access to files. Crypt4gh encrypts the content key for each recipient in the file header. It supports modifying the file header (especially when streaming) to add or remove more recipients based on their provided public keys. This flexibility enables the software to conveniently serve the data encrypted per user.

Alternatives: were not considered for this architecture as the suggestion was agreed on the general level of GDI.

File-checksums using MD5 and SHA-256

Context: the system needs to protect the integrity of the genomic data, and the best way to do that is using checksum computation. Even if a single bit in the content changes, so will the checksum, which informs that the content is not the same. MD5 checksum algorithm is so common, that it is hard to avoid, however, SHA-256 is more sensitive to the data changes. At least one (or even better, both) of these checksums must be provided.

Although the checksum cannot be used for restoring the content, it can signal the content-change, so that the content could be recovered from a backup.

Checksum computation should also cover decrypted content (within encrypted file). The system should also store file size, as comparing this number is faster than the checksum computation.

This information should be exposed by storage solutions. The system should keep a copy of the properties to enable content verification.

Alternatives: there are many hash-functions available for checksum calculation to choose from. However, MD5 and SHA-256 are most commonly used.

Vault as the secrets management service

Context: software components use sensitive parameters to connect to external services (e.g. database, authentication service, private keys). The parameters are often provided through environment variables, which might leak if an attacker should be able to obtain their values. To harden access to secrets, this architecture suggests Vault software for protecting the parameters. The software components would need access to Vault (over HTTPS) for reading their configuration.

Alternatives: develop a custom secret management service, though this approach has more risks, such as time-effort.

User Interface

Supported browsers: Chrome, Firefox, Edge, Safari

Context: since GDI is a platform-service for multiple organisations, the best way to serve the users is through an HTML-based user-interfaces. HTML is rendered by web-browsers, which have many implementations (including deployed versions). It is clear that supporting every browser is quite demanding, while the benefit is questionable. Therefore, the architecture assumes the use of most common web-browsers in the market: Google Chrome, Mozilla Firefox, Microsoft Edge, and Apple Safari. Note that many other browsers, including Edge, are based on the same browser engine as Chrome.

Alternatives: exclude even more browsers by supporting only Chrome, however, this may potentially result in user-complaints. On the other hand, supporting more browsers is questionable, since it takes more time, and there’s no metric yet for checking the user-agents used for visiting the website.

Supported devices: desktops, tablets, smartphones

Context: presumably most users access the GDI website and management UIs from their laptop or desktop computers. However, nowadays is very common that people look up websites also from their hand-held devices (smartphones and tablets). Therefore, while the large-screen desktop appearance of the web-based user-interface is critical, the UI should also adapt to small screens, as users definitely will notice issues with that. The recommended approach is to design the website for both desktop and mobile at the same time, considering more than two view-breakpoints.

Alternatives: wide-screen displays, low-resolution displays, terminal browsers, legacy mobile phones with really small screens. These are not common cases.

User interface must be served over HTTPS only

Context: for reliable communication with the user, the HTML and API based user-interface must be served over HTTPS only. This includes referred resources (CSS, JavaScript, images, documents). The content must not be accessible over HTTP without TLS. In addition, the website must use CORS HTTP headers for restricting the list of sources for script execution.

Alternatives: none.

Singe Page Application for UI

Context: the part of the UI that enables user and data administration is highly dependent on user-session and interface engineering, so the traditional static web-page serving is not sufficient. To simplify UI management, the non-static part should be delivered to the user as a JavaScript application that works right in the user’s browser, and handles backend requests itself. This enables smooth and unified user-experience, and reduces complexity on the server-side. This frontend application uses the customised API (over browser’s Fetch API) to handle user-triggered tasks. In addition, the frontend can observe server-side notifications through the WebSocket API (supported by both browser and the GDI node server).

In the GDI node application, the frontend logic is developed using VueJS framework. It is easy to learn and a widely used solution, that is also well supported by the community. This framework has shown stability over the past and is therefore less risky choice in terms of possible future changes to the framework.

Alternatives: There are many frontend framework solutions, from larger and well-known to smaller ones. Among the well-known ones are React, Angular, and Svelte. The team chose VueJS due to its easy-to-learn style, simplicity, and team’s familiarity with it.

Deployment

Kubernetes as the production deployment environment

Context: deployment environment needs to support scalability of hardware and also container-based deployment. Kubernetes as a deployment service can be used to build a cluster consisting of multiple nodes running software in containers. It is a widely used solution that is also supported by HPC at University of Tartu, and also TEHIK (responsible for national health-care systems in Estonia). The preferred deployment method is kubectl and application-specific Kubernetes manifests. Deployment may use restricted resources (indluding Docker images), which may require authorised access.

Alternatives: Docker Swarm is technically also supported thanks to Docker Compose but not tested. However, Kubernetes is generally considered to be a more advanced deployment platform.

HPC of University of Tartu as the service provider in Estonia

Context: there are many Kubernetes-based infrastracture providers on the market. HPC was considered most desirable for following properties:

Performance – HPC has the best technical resources in Estonia for running demanding data analysis on big data.
Existing data – Estonian Biobank is already based on HPC, so it’s much easier to integrate the GDI node software with existing bioinformatics practices.
Helpdesk – HPC has staff capable of managing both technical and system operations (including for Estonian Biobank, so their communication channels are already in place).
Technical profile – HPC is already providing expected technical services for the GDI node software: Kubernetes, Docker image registry, Minio (S3) storage, and Vault secret management.

Alternatives: deployment in Estonian Biobank (Docker Swarm) or TEHIK (X-Road integration; non-health-care data; data duplication needed) have issues that make the solution more complex and increase the cost.

Database structure migrations as a built-in feature

Context: database needs extensive configuration (users, schemas, tables, permissions, etc) and this process needs to be automated. On the other hand, this configuration must also be easy to read, debug, and update. It is possible that the configuration would have to accept some parameters externally (e.g. credentials).

To manage this task, there needs to be a component that automatically runs database scripts as an administrative user, and verifies that the database schema matches expectations (i.e., database users can access their tables by expected column-names). The preferred tool for this task is Flyway, which can accept a structured set of SQL files, and perform up- or down-grades as defined in the scripts.

Database migrations are executed automatically before the new version gets deployed. The changes between two versions need to be backward compliant: while new version is deployed, there can be previous instances (expecting old schema) running until the new version is ready to accept connections.

It is good to have a single database changeset per application version. Less than version 1.0 changesets should be combined into single version 1.0 changeset.

Alternatives: Liquibase.