EMC Isilon & RCH Solutions
This case study describes how Sanofi R&D (Sanofi) addressed its scientific computing challenges by building out a next generation High Performance Computing (HPC) hub for its R&D activities. The case study also highlights how RCH Solutions implemented EMC Isilon NAS to address the next generation HPC hub requirements.
Copyright © 2014 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.
The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
All other trademarks used herein are the property of their respective owners.
Part Number H12929
Table of Contents
New drugs introduced today in a therapeutic category such as cardiovasculars maintain a their competitive advantage for less than two years as compared to 10+ plus years during the 1970s . In addition to the competitive landscape, the pharmaceutical industry is under increased fiscal pressure from a range of issues, including: decreased revenues from patent losses, changes in healthcare policy, and increased regulatory requirements.
To successfully compete in this challenging business climate, pharmaceutical and biotechnology companies are executing strategies that balance research innovation while containing Research & Development (R&D) costs. To spur innovation and disrupt the competitive advantage status quo, IT organizations are re-inventing high performance computing in the pharmaceutical R&D environment.
Blocked By Dispersed Data And Computing
Many pharmaceutical companies today are made up of many independent R&D departments. At each of these departments, research teams no longer work at the lab bench but work in silico generating massive amounts of data from a variety of laboratory instruments designed for next generation sequencing, liquid chromatography – mass spectrometry (LC-MS), x-ray crystallography, electron microscopy and high-content screening workflows. The data generated from these workflows is measured in terabytes to petabytes. It is also not uncommon for each of these research departments to host and maintain their own high performance computing (HPC) and storage environments designed to support the primary workflows of that department.
Although operating independently was once viewed as a competitive advantage, now the silos of data and compute are a disadvantage. To derive the maximum value from research data, today’s data mining and analysis requires sharing it between departments and external collaborators, combining it with data sourced from public resources like The Cancer Genome Atlas (TCGA), and moving selected data sets to available computing resources best matched for the desired analysis.
Unfortunately dispersed data and compute pose operational challenges. Networking between sites may be insufficient or very expensive to move terabyte sized data sets in reasonable amount of time. Because each research department has “unique” needs, additional staff is required to maintain and support the heterogonous computing hardware and software environments. Extra resources are also needed to manage technology vendors and product support contracts. Also, it is not uncommon for research groups compete for priority access to computing resources. This can lead to underutilized processors or oversubscribed services. Competing for access also leads teams to set up computing and storage resources independent of IT (aka “Shadow IT”) which may lead to governance, risk and compliance challenges for the organization at large.
Sanofi R&D (Sanofi) is one company that successfully re-invented its scientific computing capabilities to foster research innovation. Sanofi addressed its R&D data and compute challenges by creating a central, shared computing and storage platform that could scale capacity and performance with predictable cost. The design and implementation of this next generation computing resource at Sanofi included the services of an experienced scientific computing partner, RCH Solutions (RCH).
Before building this shared computing resource, Sanofi and RCH initiated an internal program to examine the existing High Performance Computing (HPC) capabilities at Sanofi. The examination used a holistic approach that evaluated requirements for Symmetric Multi-Processing (SMP), high-performance / graphical / 3D workstations, data storage, and networking.
Looking Beyond Today
RCH has been providing managed services supporting research computing at Sanofi since 2006 and was able to provide detailed information, inventories, and historical trends for HPC deployments in the Americas and Europe. But the scope of the HPC assessment at Sanofi was far broader than a typical technology refresh. The assessment factored in objectives detailed outlined in longer-term R&D plans, interviews with the majority of in silico researchers and their management, a review of applications in use, and evaluations and assessments of Cloud-based technologies. The scope of the examination also included a focus on emerging and disruptive technologies that were likely to become part of a next generation, in silico Research Computing platform.
The results of this comprehensive examination were funneled directly into an initiative to design and implement a central, next generation HPC platform for Research Computing at Sanofi.
Tackling Data Location
The key factor in the design, along with Sanofi’s desire to achieve scale via a shared platform, was data management and movement. Even with a commitment to build-out a high-performing Metro-Area Network (MAN), moving terabytes of data between repositories would consume precious time in the fast moving R&D world and result in duplication of large datasets. To address this challenge, the results of the assessment proposed using scale-out, Network Attached Storage (NAS) technology from EMC Isilon.
RCH first introduced Isilon NAS technology at a Sanofi research facility in Cambridge, MA in 2009. The initial storage cluster was only several terabytes, but it eventually grew to over 5 petabytes. The Isilon storage cluster not only scaled out in capacity, but also scaled out performance required by HPC jobs. The Isilon cluster also provided a single name space managed by 0.25 Full Time Equivalent (FTE). With the success of Isilon technology supporting Research Computing at Sanofi (Cambridge), the design decision was made to again use the Isilon technology for what would become the Research Computing platform for Sanofi in the US – the “Boston Hub.” To support the centralization of EMC Isilon NAS storage and to achieve the desired levels of performance, the Boston Hub required the build-out of fault-tolerant 10-gigabit Ethernet network.
Servicing A Spectrum Of HPC Workloads
It was also very clear that the Boston Hub would need to provide computing that could address a spectrum of HPC workloads that ranged from the embarrassing parallel to fine grained jobs.
The design and implementation included server technology from Hewlett-Packard (HP), an enterprise-wide compute standard for a number of RCH customers. HP compute technology was to be configured as Beowulf Cluster (core count in excess of 1,000). There were additional servers adjacent to the cluster for application specific needs, all with 10-gigabit Ethernet connectivity to the EMC Isilon NAS. Servers in the Beowulf cluster were configured for grids managed by Open Grid Scheduler (OGS) as the Distributed Resource Manager (DRM). Platform Computing’s Load Sharing Facility (LSF) is also available where niche requirements mandate that LSF act as the DRM.
R&D is a dynamic environment and teams are constantly iterating on new approaches to answer questions in silico. These approaches depend on the ability to access, retrieve, manage and move data using a variety of protocols like SMB and NFS. During the assessment, Hadoop and was starting to emerge as new option for accessing and computing data. Although there was no immediate use for Hadoop, support for it was a functional requirement for the Hub. Because the Isilon NAS provides native support for multiple protocols including the Hadoop Distributed File System (HDFS), the use of Isilon in the Hub’s design ensured that it could support to the rapid introduction of Hadoop at a later date without re-architecting the platform.
Finally, a separate dedicated Internet connection was specified and provisioned for Research Computing in the Boston Hub to facilitate data exchange with outside data sources and collaborators. Included with this was the evaluation of network accelerator technologies (Riverbed, Aspera, etc.).
The centralized shared platform to support Sanofi’s US Research Computing was brought on-line for early adopters in December 2012. Research teams and projects incrementally moved to the new-shared platform during the first half of 2013. By its first birthday, Boston Hub is now a multi-tenant compute and storage cluster that has run in excess of five million jobs. Its scale provides vast improvements in throughput compared to the dispersed pockets of legacy HPC deployments, and it is enabling the creation of compute work-flows not previously possible.
For Sanofi the Boston Hub is now the “center of gravity” for Research Computing and Isilon NAS with HPC compute are its foundation. The multi-petabyte NAS employs a combination of Isilon X and NL nodes to support data performance and density requirements. With the success of this new platform and the Managed Services provided by RCH, Sanofi is provision more Isilon nodes and compute to support the organic growth of the Hub.
RCH, leveraging EMC Isilon scale-out NAS technology, helped Sanofi achieve its goals to establish a centralized Research Computing center. The new functionally meets today’s research requirements, and at the same time ready to scale and adapt to tomorrow’s systems, storage, network, and applications needs using only a handful of administrators.
Data management concerns are eased by managing data in a large single name space, where data is automatically tiered between higher performing components and more dense components. This process is based on policies, and data is readily available to compute resources via high-speed networks. This new platform will not be a limiting factor for Sanofi in achieving research innovation.
About RCH Solutions
RCH Solutions is a Managed Services Provider and a Systems Integrator with a specific focus and expertise in Research Computing. Since 1991, RCH has provided Solutions that include hardware, software, and services for Life Sciences and Healthcare companies. RCH has helped our customers achieve measureable performance results and cost savings by acting as experienced liaisons between the research teams and the information technology groups that support them. Our specific focus and experience in Research Computing has allowed RCH to become the trusted partner advisor of organizations that see value in Managed Services and System Integration expertise. As subject matter experts, we understand the entire research computing stack while ensuring comprehensive ownership, accountability and completeness. We believe the most effective and efficient model, to support Research computing, is nimble, transparent, responsive and adaptive to the unique needs of the scientists. RCH has a thorough understanding of and deep domain expertise in, the many challenges of Life Sciences – which allows us to help solve scientific and business problems with respect to research computing. Additional information about RCH Solutions can be found at www.rchsolutions.com
About EMC Corporation
EMC Corporation is a global leader in enabling businesses and service providers to transform their operations and deliver IT as a service. Fundamental to this transformation is cloud computing. Through innovative products and services, EMC accelerates the journey to cloud computing, helping IT departments to store, manage, protect and analyze their most valuable asset—information—in a more agile, trusted and cost-efficient way. Additional information about EMC can be found at www.EMC.com
EMC Isilon is fully committed to advances in application development—including supporting the trend to incorporate Hadoop into evolving Life Sciences applications. EMC Isilon is the only scale-out NAS platform natively integrated with the Hadoop Distributed File System (HDFS). Using HDFS as an over-the-wire protocol, you can deploy a powerful, efficient, and flexible Big Data storage and analytics ecosystem. Isilon storage and analytics solutions support multiple instances of Apache Hadoop distributions from different vendors simultaneously—including Pivotal HD, Cloudera CHD, and Hortonworks Data Platform. Our solutions also support both HDFS 1.0 and HDFS 2.0. This allows you to leverage the specific tools you need for each of your unstructured data analytics projects.
Isilon’s in-place analytics approach eliminates the need to invest in a standalone Hadoop infrastructure. Our solution also allows you to eliminate the time and resources required to replicate your data into a separate infrastructure. This means that you can initiate data analytics projects faster and get results in a matter of minutes. And when your data changes, simply rerun the job with no re-ingest requirement.