Containerization: The New Standard for Reproducible Scientific Computing

Cloud Computing | 3 Min Read

Yogesh Phulke
Senior Cloud Engineer
April 13, 2021

Containers resolve deployment and reproducibility issues in Life Science computing.

Bioinformatics software and scientific computing applications are crucial parts of the Life Science workflow. Researchers increasingly depend on third-party software to generate insights and advance their research goals.

These third-party software applications typically undergo frequent changes and updates. While these updates may improve functionalities, they can also impede scientific progress in other ways.

Research pipelines that rely on computationally intensive methodologies are often not easily reproducible. This is a significant challenge for scientific advancement in the Life Sciences, where replicating experimental results – and the insights gleaned from analyzing those results – is key to scientific progress.

The Reproducibility Problem Explained

For Life Science researchers, reproducibility falls into four major categories:

Direct Replication is the effort to reproduce a previously observed result using the same experimental conditions and design as an earlier study.

Analytic Replication aims to reproduce scientific findings by subjecting an earlier data set to new analysis.

Systemic Replication attempts to reproduce a published scientific finding under different experimental conditions.

Conceptual Replication evaluates the validity of an experimental phenomenon using a different set of experimental conditions.

Researchers are facing challenges in some of these categories more than others. Improving training and policy can help make direct and analytic replication more accessible. Systemic and conceptual replication is significantly harder to address effectively.

These challenges are not new. They have been impacting research efficiency for years. In 2016, Nature published a study showing that out of 1,500 life science researchers, more than 70% failed to reproduce another scientist’s experiments.

There are multiple factors responsible for the ongoing “reproducibility crisis” facing the life sciences. One of the most important challenges scientists need to overcome is the inability to easily assemble software tools and their associated libraries into research pipelines.

This problem doesn’t fall neatly into one of the categories above, but it impacts each one of them differently. Computational reproducibility forms the foundation that direct, analytic, systemic, and conceptual replication techniques all rely on.

Challenges to Computational Reproducibility

Advances in computational technology have enabled scientists to generate large, complex data sets during research. Analyzing and interpreting this data often depends heavily on specific software tools, libraries, and computational workflows.

It is not enough to reproduce a biotech experiment on its own. Researchers must also reproduce the original analysis, using the computational techniques that previous researchers used, and do so in the same computing environment. Every step of the research pipeline has to conform with the original study in order to truly test whether a result is reproducible or not.

This is where advances in bioinformatic technology present a bottleneck to scientific reproducibility. Researchers cannot always assume they will have access to (or expertise in) the technologies used by the scientists whose work they wish to reproduce. As a result, achieving computational reproducibility turns into a difficult, expensive, and time-consuming experience – if it’s feasible at all.

How Containerization Enables Reproducibility

Put simply, a container consists of an entire runtime environment: an application, plus all its dependencies, libraries, and other binaries, and configuration files needed to run it, bundled into one package. By containerizing the application platform and its dependencies, differences in OS distributions and underlying infrastructure are abstracted away.

If a researcher publishes experimental results and provides a containerized copy of the application used to analyze those results, other scientists can immediately reproduce those results with the same data. Likewise, future generations of scientists will be able to do the same regardless of upcoming changes to computing infrastructure.

Containerized experimental analyses enable life scientists to benefit from the work of their peers and contribute their own in a meaningful way. Packaging complex computational methodologies into a unique, reproducible container ensure that any scientist can achieve the same results with the same data.

Bringing Containerization to the Life Science Research Workflow

Life Science researchers will only enjoy the true benefits of containerization once the process itself is automatic and straightforward. Biotech and pharmaceutical research organizations cannot expect their researchers to manage software dependencies, isolate analyses away from local computational environments, and virtualize entire scientific processes for portability while also doing cutting-edge scientific research.

Scientists need to be able to focus on the research they do best while resting assured that their discoveries and insights will be recorded in a reproducible way. Choosing the right technology stack for reproducibility is a job for an experienced biotech IT consultant with expertise in developing R&D workflows for the biotech and pharmaceutical industries.

RCH Solutions helps Life Science researchers develop and implement container strategies that enable scalable reproducibility. If you’re interested in exploring how a container strategy can support your lab’s ability to grow, contact our team to learn more.

RCH Solutions is a global provider of computational science expertise, helping Life Sciences and Healthcare firms of all sizes clear the path to discovery for nearly 30 years. If you’re interesting in learning how RCH can support your goals, get in touch with us here.

Yogesh Phulke

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Containers resolve deployment and reproducibility issues in Life Science computing.

The Reproducibility Problem Explained

Challenges to Computational Reproducibility

How Containerization Enables Reproducibility

Bringing Containerization to the Life Science Research Workflow

Related Reading