Containerization: The New Standard for Reproducible Scientific Computing

Containers resolve deployment and reproducibility issues in life science computing.

By Yogesh Phulke

Senior Cloud Engineer,  RCH Solutions

Bioinformatics software and scientific computing applications are crucial parts of the life science workflow. Researchers increasingly depend on third-party software to generate insights and advance their research goals.

These third-party software applications typically undergo frequent changes and updates. While these updates may improve functionalities, they can also impede scientific progress in other ways.

Research pipelines that rely on computationally intensive methodologies are often not easily reproducible. This is a significant challenge for scientific advancement in the life sciences, where replicating experimental results – and the insights gleaned from analyzing those results – is key to scientific progress.

The Reproducibility Problem Explained

For life science researchers, reproducibility falls into four major categories:

  • Direct Replication is the effort to reproduce a previously observed result using the same experimental conditions and design as an earlier study.
  • Analytic Replication aims to reproduce scientific findings by subjecting an earlier data set to new analysis.
  • Systemic Replication attempts to reproduce a published scientific finding under different experimental conditions.
  • Conceptual Replication evaluates the validity of an experimental phenomenon using a different set of experimental conditions.

Researchers are facing challenges in some of these categories more than others. Improving training and policy can help make direct and analytic replication more accessible. Systemic and conceptual replication is significantly harder to address effectively.

These challenges are not new. They have been impacting research efficiency for years. In 2016, Nature published a study showing that out of 1,500 life science researchers, more than 70% failed to reproduce another scientist’s experiments.

There are multiple factors responsible for the ongoing “reproducibility crisis” facing the life sciences. One of the most important challenges scientists need to overcome is the inability to easily assemble software tools and their associated libraries into research pipelines.

This problem doesn’t fall neatly into one of the categories above, but it impacts each one of them differently. Computational reproducibility forms the foundation that direct, analytic, systemic, and conceptual replication techniques all rely on.

Challenges to Computational Reproducibility

Advances in computational technology have enabled scientists to generate large, complex data sets during research. Analyzing and interpreting this data often depends heavily on specific software tools, libraries, and computational workflows.

It is not enough to reproduce a biotech experiment on its own. Researchers must also reproduce the original analysis, using the computational techniques that previous researchers used, and do so in the same computing environment. Every step of the research pipeline has to conform with the original study in order to truly test whether a result is reproducible or not.

This is where advances in bioinformatic technology present a bottleneck to scientific reproducibility. Researchers cannot always assume they will have access to (or expertise in) the technologies used by the scientists whose work they wish to reproduce. As a result, achieving computational reproducibility turns into a difficult, expensive, and time-consuming experience – if it’s feasible at all.

How Containerization Enables Reproducibility

Put simply, a container consists of an entire runtime environment: an application, plus all its dependencies, libraries, and other binaries, and configuration files needed to run it, bundled into one package. By containerizing the application platform and its dependencies, differences in OS distributions and underlying infrastructure are abstracted away. 

If a researcher publishes experimental results and provides a containerized copy of the application used to analyze those results, other scientists can immediately reproduce those results with the same data. Likewise, future generations of scientists will be able to do the same regardless of upcoming changes to computing infrastructure. 

Containerized experimental analyses enable life scientists to benefit from the work of their peers and contribute their own in a meaningful way. Packaging complex computational methodologies into a unique, reproducible container ensure that any scientist can achieve the same results with the same data.

Bringing Containerization to the Life Science Research Workflow

Life science researchers will only enjoy the true benefits of containerization once the process itself is automatic and straightforward. Biotech and pharmaceutical research organizations cannot expect their researchers to manage software dependencies, isolate analyses away from local computational environments, and virtualize entire scientific processes for portability while also doing cutting-edge scientific research.

Scientists need to be able to focus on the research they do best while resting assured that their discoveries and insights will be recorded in a reproducible way. Choosing the right technology stack for reproducibility is a job for an experienced biotech IT consultant with expertise in developing R&D workflows for the biotech and pharmaceutical industries.

RCH Solutions helps life science researchers develop and implement container strategies that enable scalable reproducibility. If you’re interested in exploring how a container strategy can support your lab’s ability to grow, contact our team to learn more.

Ready to make a change?

If your organization is ready to imagine the possibilities with the right specialty IT partner by its side, click below to set up a consultation with RCH Solutions.

Philadelphia Headquarters
992 Old Eagle School Road
Wayne, PA 19087
610-902-0400

Boston
90 Canal Street, 4th Floor
Boston, MA 02114
617-674-2029

San Diego
4660 LaJolla Village Drive
Suite 500
San Diego, CA 92122
858-877-9488

Belgium
Avenue Louise 149/24
B1050 Brussel, Belgium

©Copyright 2020 RCH Solutions     |  Privacy    |   Terms & Conditions