How Big Data Is Powering Precision Medicine

How Big Data Is Powering Precision Medicine

Data science has earned a prominent place on the front lines of precision medicine – the ability to target treatments to the specific physiological makeup of an individual’s disease. As cloud computing services and open-source big data have accelerated the digital transformation, small, agile research labs all over the world can engage in development of new drug therapies and other innovations. Previously, the necessary open-source databases and high-throughput sequencing technologies were accessible only by large research centers with the necessary processing power. In the evolving big data landscape, startup and emerging biopharma organizations have a unique opportunity to make valuable discoveries in this space. 

The drive for real-world data

Through big data, researchers can connect with previously untold volumes of biological data. They can harness the processing power to manage and analyze this information to detect disease markers and otherwise understand how we can develop treatments targeted to the individual patient. Genomic data alone will likely exceed 40 exabytes by 2025 according to 2015 projections published by the Public Library of Science journal Biology. As data volume increases, its accessibility to emerging researchers improves as the cost of big data technologies decreases. 

A recent report from Accenture highlights the importance of big data in downstream medicine, specifically oncology. Among surveyed oncologists, 65% said they want to work with pharmaceutical reps who can fluently discuss real-world data, while 51% said they expect they will need to do so in the future. 

The application of artificial intelligence in precision medicine relies on massive databases the software can process and analyze to predict future occurrences. With AI, your teams can quickly assess the validity of data and connect with decision support software that can guide the next research phase. You can find links and trends in voluminous data sets that wouldn’t necessarily be evident in smaller studies. 

Applications of precision medicine

Among the oncologists Accenture surveyed, the most common applications for precision medicine included matching drug therapies to patients’ gene alterations, gene sequencing, liquid biopsy, and clinical decision support. In one example of the power of big data for personalized care, the Cleveland Clinic Brain Study is reviewing two decades of brain data from 200,000 healthy individuals to look for biomarkers that could potentially aid in prevention and treatment. 

AI is also used to create new designs for clinical trials. These programs can identify possible study participants who have a specific gene mutation or meet other granular criteria much faster than a team of researchers could determine this information and gather a group of the necessary size. 

A study published in the journal Cancer Treatment and Research Communications illustrates the impact of big data on cancer treatment modalities. The research team used AI to mine National Cancer Institute medical records and find commonalities that may influence treatment outcomes. They determined that taking certain antidepressant medications correlated with longer survival rates among the patients included in the dataset, opening the door for targeted research on those drugs as potential lung cancer therapies. 

Other common precision medicine applications of big data include:

  • New population-level interventions based on socioeconomic, geographic, and demographic factors that influence health status and disease risk
  • Delivery of enhanced care value by providing targeted diagnoses and treatments to the appropriate patients
  • Flagging adverse reactions to treatments
  • Detection of the underlying cause of illness through data mining
  • Human genomics decoding with technologies such as genome-wide association studies and next-generation sequencing software programs

These examples only scratch the surface of the endless research and development possibilities big data unlocks for start-ups in the biopharma sector. Consult with the team at RCH Solutions to explore custom AI applications and other innovations for your lab, including scalable cloud services for growing biotech and pharma research organizations.

Why You Need an R Expert on Your Team

R enables researchers to leverage reproducible data science environments.

Life science research increasingly depends on robust data science and statistical analysis to generate insight. Today’s discoveries do not owe their existence to single “eureka” moments but the steady analysis of experimental data in reproducible environments over time.

The shift towards data-driven research models depends on new ways to gather and analyze data, particularly in very large datasets. Often, researchers don’t know beforehand whether those datasets will be structured or unstructured or what kind of statistical analysis they need to perform in order to reach research goals.

R has become one of the most popular programming languages in the world of data science because it answers these needs. It provides a clear framework for handling and interpreting large amounts of data. As a result, life science research teams are increasingly investing in R expertise in order to meet ambitious research goals.

How R Supports Life Science Research and Development

R is a programming language and environment designed for statistical computing. It’s often compared to Python because the two share several high-profile characteristics. They are both open-source programming languages that excel at data analysis. 

The key difference is that Python is a general-purpose language. R was designed specifically and exclusively for data science applications. It offers researchers a complete ecosystem for data analysis and comes with an impressive variety of packages and libraries built for this purpose. Python’s popularity relies on it being relatively straightforward and easy to learn. Mastering R is much more challenging but offers far better solutions for data visualization and statistical analysis. R has earned its place as one of the best languages for scientific computing because it is interpreted, vector-based, and statistical:

  • As an interpreted language, R runs without the need for a secondary compiler. Researchers can run code directly, which makes it faster and easier to interpret data using R.
  • As a vector language, R allows anyone to add functions to a single vector without inserting a loop. This makes R faster and more powerful than non-vector languages.
  • As a statistical language, R offers a wide range of data science and visualization libraries ideal for biology, genetics, and other scientific applications.

While the concept behind R is simple enough for some users to get results by learning on the fly, many of its most valuable functions are also its most complex. Life science research teams that employ R experts are well-positioned to address major pain points associated with using R while maximizing the research benefits it provides.

Challenges Life Science Research Teams Face When Implementing R

A large number of life science research teams already use R to some degree. However, fully optimized R implementation is rare in the life science industry. Many teams face steep challenges when obtaining data-driven research results using R:

  1. Maintaining Multiple Versions of R Packages Can Be Complex

Reproducibility is the defining component of scientific research and software development. Source code control systems make it easy for developers to track and manage different versions of their software, fix bugs, and add new features. However, distributed versioning is much more challenging when third-party libraries and components are involved. 

Any R application or script will draw from R’s rich package ecosystem. These packages do not always follow any formal management system. Some come with extensive documentation, and others simply don’t. As developers update their R packages, they may inadvertently break dependencies that certain users rely on. Researchers who try to reproduce results using updated packages may get inaccurate outcomes.

Several high-profile R developers have engineered solutions to this problem. Rstudio’s Packrat is a dependency management system for R that lets users reproduce and isolate environments, allowing for easy version control and collaboration between users.

Installing a dependency management system like Packrat can help life science researchers improve their ability to manage R package versions and ensure reproducibility across multiple environments. Life science research teams that employ R experts can make use of this and many other management tools that guarantee smooth, uninterrupted data science workflows.

  1. Managing and Administrating R Environments Can Be Difficult 

There is often a tradeoff between the amount of time spent setting up an R environment and its overall reproducibility. It’s relatively easy to create a programming environment optimized for a particular research task in R with minimal upfront setup time. However, that environment may not end up being easily manageable or reproducible as a result.

It is possible for developers to go back and improve the reproducibility of an ad-hoc project after the fact. This is a common part of the R workflow in many life science research organizations and a critical part of production analysis. However, it’s a suboptimal use of research time and resources that could be better spent on generating new discoveries and insights.

Optimizing the time users spend creating R environments requires considering the eventual reproducibility needs of each environment on a case-by-case basis: 

  • An ad-hoc exploration may not need any upfront setup since reproduction is unlikely. 
  • If an exploration begins to stabilize, users can establish a minimally reproducible environment using the session_info utility. It will still take some effort for a future user to rebuild the dependency tree from here.
  • For environments that are likely candidates for reproduction, bringing in a dependency management solution like Packrat from the very beginning ensures a high degree of reproducibility.
  • For maximum reproducibility, configuring and deploying containers using a solution like Docker guarantees all dependencies are tracked and saved from the start. This requires a significant amount of upfront setup time but ensures a perfectly reproducible, collaboration-friendly environment in R.

Identifying the degree of reproducibility each R environment should have requires a great degree of experience working within R’s framework. Expert scientific computing consultants can play a vital role in helping researchers identify the optimal solution for establishing R environments.

  1. Some Packages Are Complicated and Time-Consuming to Install

R packages are getting larger and more complex, which significantly impacts installation time. Many research teams put considerable effort into minimizing the amount of time and effort spent on installing new R packages. 

This can become a major pain point for organizations that rely on continuous integration (CI) strategies like Travis or GitLab-CI. The longer it takes for you to get feedback on your CI strategy, the slower your overall development process runs as a result. Optimized CI pipelines can help researchers spend less time waiting for packages to install and more time doing research.

Combined with version management problems, package installation issues can significantly drag down productivity. Researchers may need to install and test multiple different versions of the same package before arriving at the expected result. Even if a single installation takes ten minutes, that time quickly adds up.

There are several ways to optimize R package installation processes. Research organizations that frequently install packages directly from source code may be able to use a cache utility to reduce compiling time. Advanced versioning and package management solutions can reduce package installation times even further.

  1. Troubleshooting R Takes Up Valuable Research Time

While R is simple enough for scientific personnel to learn and use quickly, advanced scientific use cases can become incredibly complex. When this happens, the likelihood of generating errors is high. Troubleshooting errors in R can be a difficult and time-consuming task and is one of the most easily preventable pain points that come with using R.

Scientific research teams that choose to contract scientific computing specialists with experience in R can bypass many of these errors. Having an R expert on board and ready to answer your questions can mean the difference between spending hours resolving a frustrating error code and simply establishing a successful workflow from the start.

R has a dynamic and highly active community, but complex life science research errors may be well outside the scope of R troubleshooting. In environments with strict compliance and cybersecurity rules in place, you may not be able to simply post your session_info data on a public forum and ask for help.

Life science research organizations need to employ R experts to help solve difficult problems, optimize data science workflows, and improve research outcomes. Reducing the amount of time researchers spend attempting to resolve error codes is key to maximizing their scientific output.

RCH Solutions Provides Central Management for Scientific Workflows in R

Life science research firms that rely on scientific computing partners like RCH Solutions can free up valuable research resources while gaining access to R expertise they would not otherwise have. A specialized team of scientific computing experts with experience using R can help life science teams alleviate the pain points described above.

Life science researchers bring a wide range of incredibly valuable scientific expertise to their organizations. This expertise may be grounded in biology, genetics, chemistry, or many other disciplines, but it does not necessarily predict a great deal of experience in performing data science in R. Scientists can do research without a great deal of R knowledge – if they have a reliable scientific computing partner.

RCH Solutions allows life science researchers to centrally manage R packages and libraries. This enables research workflows to make efficient use of data science techniques without costing valuable researcher time or resources. 

Without central management, researchers are likely to spend a great deal of time trying to install redundant packages. Having multiple users spend time installing large, complex R packages on the same devices is an inefficient use of valuable research resources. Central management prevents users from having to reinvent the wheel every time they want to create a new environment in R.

Optimize Your Life Science Research Workflow with RCH Solutions

Contracting a scientific computing partner like RCH Solutions means your life science research workflow will always adhere to the latest and most efficient data practices for working in R. Centralized management of R packages and libraries ensures the right infrastructure and tools are in place when researchers need to create R environments and run data analyses.

Find out how RCH Solutions can help you build and implement the appropriate management solution for your life science research applications and optimize deployments in R. We can aid you in ensuring reproducibility in data science applications. Talk to our specialists about your data visualization and analytics needs today.