Shaping Scientific Data is Like Growing a Square Tree

 
 Photo credit:  www.terjeenge.com

Photo credit: www.terjeenge.com

 

Summary

  • The life science industry is generating more data than ever, but effective ways of accessing these data are lacking
  • Experimental data is a dynamic network of data points much like a tree, but common tools only allow two dimensional relationships to be defined. This limitation of our tools impedes scientific discovery.
  • R&D organizations want global real-time access to their data in a format that is structured for statistical analysis.
  • Riffyn SDE was designed as a platform for global R&D organizations to capture, structure and integrate their data automatically. It allows scientists to visually design their experimental processes. Riffyn SDE then automatically generates a flexible data model to represent complex relationships between experimental data points.

The last twenty years in R&D has seen a proliferation of digital, automated, high-throughput devices that generate mountains of data. There are hundreds of such devices which allow R&D teams to conduct more experiments and generate more data points every year. However, this expansion of experimentation has raised the question of what to do with all that data. More data often doesn’t translate into more insights and faster results. Data are only useful if they are clearly annotated, linked to all relevant data upstream and downstream, and most importantly, accessible to globally distributed teams. 

 
 Experimental data is often kept in siloed spreadsheets and databases.

Experimental data is often kept in siloed spreadsheets and databases.

 

Data analytics software and data storage solutions for life science R&D have evolved in parallel the past 20 years, but not necessarily in harmony.  On one hand, currently available data analytics tools are fantastic, including JMP, Spotfire, Tableau, R, Python…the list of outstanding analytical tools is really extensive and growing.  On the other hand, data storage solutions, such as Spreadsheets (Excel), Lab Information Management Systems (LIMS), Electronic Lab Notebooks (ELNs), Relational Databases, have been a bit of a mixed bag. They’re well understood, and they’re good for their primary purpose—storing data. However, R&D organizations are not merely interested in storing data. What they actually need is a better way to use data to extract scientific insights. Today’s data storage solutions are ill-designed to solve the data-use challenges of data annotation, joining and extraction (steps necessary for meaningful data analysis). 

 

Why are today’s data storage solutions inadequate?

The problem is 2-fold: First, they’re rigid in their design and aren’t easily adapted to change.  LIMS still require months of planning and a support team of a dozen people to implement. The result is that they are designed to solve yesterday’s scientific questions and need to be constantly re-configured to support the future. That becomes a huge barrier to scientific progress, which demands constant change and continuous improvement in process and experiment designs. 

Second, they’re not very interoperable with each other, which makes the extraction, cleaning and formatting of data sets a huge chore. For example, joining procedural data (usually stored in an ELN) with results data (usually stored in spreadsheets or LIMS systems) is no easy task; annotating sample analytical data with all relevant procedural information used to generate the samples usually involves heroic effort, and extracting data from multiple linked unit operations in an analysis-ready format is often impossible. 

So where have we gone astray?

 

Data has branches like a tree

The ramifications of these limitations are significant — they can stunt scientific discovery. New drugs are slowly discovered or go undiscovered. Novel products go undeveloped. 

One oft-unaddressed challenge faced by legacy architectures is that lab data looks more like a tree with a network of relationships between data points, not like the square we want it to be for machine learning and statistics. Typically data and analytical systems need data in defined rows and columns. 

Let’s consider, for example, a standard bacterial fermentation. A 100 L fermentor may require sampling every 15 minutes to check for pH, optical density, product concentration (yield) and a number of additional tests. Inlet and outlet gas compositions also need to be collected and stored (often in real-time). Each fermentor is fed a prepared media to sustain the bacteria. Further, the 100 L fermentor may have been inoculated from a cell culture generated from a series of other fermentations starting from a 200 μL vial - each having their own set of media, samples, and gas compositions. 

In order to conduct a meaningful analysis of this fermentation (that can control for things like random variances), these sheets need to be joined together into a statistical data format. This means all time points need to be aligned; sample data need to be connected to their sources; different units must be normalized (Celsius vs Fahrenheit); data needs to be annotated with consistent terms; sample IDs need to be rectified.  Perhaps the most difficult part is simply finding the data. They reside in LIMS systems, spreadsheets, SOPs, batch sheets, and data historians.  This process can take weeks to complete, even for common procedures such as fermentation. 

All this is considered normal! We’ve lived in this world for decades so nothing feels strange about it. The problem underlying this issue is that scientific data is multi-dimensional with many complex relationships between samples, devices, and systems. It is more akin to a network or a social graph. Complex relationships aren’t fully captured within stand-alone legacy databases and spreadsheets. This makes it incredibly difficult to restructure the data stored within them.

 

 
 Joined, flattened multidimensional data generated by Riffyn. The branching relationships between samples along an experiment are reflected in the left-to-right outline view. This is a “flattening” of the tree into a form ready for statistics and machine learning.

Joined, flattened multidimensional data generated by Riffyn. The branching relationships between samples along an experiment are reflected in the left-to-right outline view. This is a “flattening” of the tree into a form ready for statistics and machine learning.

 


Putting scientists in the driver’s seat

Scientific experiments are designed specifically to test the effect of varying conditions chosen by the experimenter. As such, the structure of a “master” data set (all the design, contextual and results data for an experiment) is also determined by the experimenter.  So for a data system to deliver the annotated,“flattened”, and joined data sets necessary for analysis, it actually needs a clear and continuously updated set of instructions from the scientist regarding the experiment context. But this heart of the problem, current data systems don't do this.

In typical labs today, data scientists are empowered with the skills to reshape and analyze data, they don't have the necessary context. They have to chase down and interview many busy R&D scientists and staff in order to understand how data should be annotated, cleaned and joined for analysis. This is a challenging, never-ending undertaking that takes a lot of time and a strong understanding of the underlying scientific design. Thus scientific activity and analyses tend to get bottlenecked on the few people with the needed computational skills. It also tends to limit access of lab scientists to the cleaned-up data and analysis tools.

Moreover science is also incredibly dynamic. Experiments you run today may be drastically different than those you recorded 3 months ago (or 3 months from now). Methods, processes, and analyses are always changing. Maybe you change an assay procedure or use very different operating conditions.  Our data systems must adapt to these changes without sacrificing your ability to work with historical data. 

To solve these issues, the next generation of data systems need to be more adaptive to the constantly changing designs of scientific experiments, and they need to put the power of data analytics more directly into the hands of the scientists that generate it.

 

Solutions are here

Every life science company from biopharma, to chemicals, to food producers all want to do more with their data and are making significant investments in it. Merck is partnering with Accenture to develop a Research Life Science Cloud to improve cost and efficiency across the entire industry. GSK is making investments to deploy AI to revolutionize its drug discovery business. BASF has made significant investments into its digital transformation initiative within their research organization. There’s real money behind this transformation and everyone is looking for the best platform on which to base their business.

Riffyn was created to serve as such a platform. We’ve developed our hypergraph data model to accurately reflect the complex structure of lab data. We’ve also incorporated a data versioning functionality to allow for new data to be easily combined with older versions.  

Data across multiple related experiments can all be easily pulled together in a structured format, ready for statistical analysis, in as fast as 30 seconds. This fully annotated and cleaned data set can then be used to ask questions such as “Which operating conditions lead to optimal results?”, “What’s the effect of material lot numbers on my yield?”, or “How do my Day 14 data compare to my Day 3 data?” These types of questions can all be discerned through data analysis without conducting additional time consuming experiments. 

Further, Riffyn SDE puts the power of data directly in the hands of the scientists that generate it. The process design environment allows users to design their experiment structure and material flow, while Riffyn SDE automatically generates the data model to match this structure and flow without any coding or UI design needed. Teams can duplicate each other’s experimental processes to leverage each other's best practices. They can collaborate amongst each other to design processes, execute experiments, and determine how one group’s work impacts other teams. 

With Riffyn SDE, R&D organizations have an opportunity to adopt a new approach to scientific discovery that delivers flexibility, context, structure, and analysis all at the same time. Data can inform next steps.  Processes can be continuously, iteratively improved. Organizational turnover doesn’t have to represent a loss of scientific experience and insight. Lab data can finally start serving its true purpose in driving organizational effectiveness.  

In short, Riffyn delivers data you can use and reuse to drive faster discovery.

 

Want to better understand how to structure your data for analysis? Check out this related article by Loren Perelman: Data is Like Ikea Furniture, It's Best Shipped Flat.

Douglas Williams