GitHub for Science
Part 2 of Tinder for Nerds
At Riffyn, we believe very deeply in the idea of scientific source code. Our whole enterprise is founded on it. Our software embodies it. Our team delivers and teaches it. Heck, we preach it.
But we didn’t invent the idea from nothing. Like all ideas, it has parents. In this case, one parent of Riffyn’s idea for scientific source code was the napkin sketch in Figure 5 below. This sketch was shared between an industrial process development scientist and manufacturing scientist who were trying to troubleshoot a problem at full-scale manufacturing. They were using the sketch to help communicate what they were doing and identify where things were differing between the process development lab and manufacturing.
It was this flow diagram, combined with our years of observations of how scientists work, and lessons from data science and measurement systems analysis that all combined to create our idea for scientific source code.
A toolset for scientific source code
Our idea took form as a hypothesis in 2013. Here is that hypothesis, as it was originally written.
Over the months and years, that hypothesis has taken shape as process-flow objects embedded in the core of Riffyn’s Scientific Development Environment (SDE) cloud software. Examples of Riffyn SDE are shown in Figure 4 in part 1 and also shown in Figure 7 below in Measure mode which is used to collect lab data.
The Riffyn SDE encodes experiments as process flow objects that capture every detail of the protocol, the reactions, the parameter settings, the sample flow, and the measured data points. These flow objects, and their associated experiment data, are unambiguous and machine readable.
Some people see this process-based representation of an experiment and say: “What the hell does this have to do with an experiment?” Others see it for the first time and say “Riffyn, where have you been all my life?”
Whether you find it instinctively right or wrong, it’s most certainly a paradigm shift. And it’s a shift that’s extraordinarily powerful. What makes it so powerful is that it can be stored, version controlled, copied, compared, reviewed, merged, and shared just like computer code. Even more, the process object provides a standardized, yet flexible, container for experimental data – it can be used to automatically contextualize and shape experiment data for analytics. In short, process objects are the source code for science, and the blueprints for machine-learning-ready data. And as source code, these process objects create a foundation for a GitHub for science. We realize that that statement is not self-evident, so let’s see this in action in a few illustrative examples below.
GitHub for science
Underlying a process flow representation of experiments in Riffyn is a machine-readable object. A fragment of that object is shown below. That object is a software-independent representation of the experiment; it can be read by any program and it fully describes the experiment. Embedded in that structure is also a complete genealogical record of the ancestor processes from which it came. That means that when one process is copied and modified to make a new process, its ancestry is tracked and included in the process information.
This machine-readable experiment object makes possible three critical capabilities for sharing and breeding scientific ideas, capabilities that are virtually impossible with traditional approaches to scientific data sharing:
1. Data integration. The process object provides an instruction set to automatically contextualize, integrate, and reshape measurement data across experimental flows, scientific disciplines, collaborating teams, and time. This provides extraordinary power and speed for applying machine learning to everyday science. See example in this video.
2. Process and experiment genealogy. It is possible to construct the entire family tree of all process and experiment design variants that led to a current experiment or process design or were derived from it. Moreover, all the experimental data from the entire family tree can be combined into one data structure for analysis. See Figure 9 below and the example in this video. This deep traceability can be critical in a product development setting when transferring technologies to manufacturing or preparing data for regulatory submission.
3. Diff and merge of experiment designs. In other words, redlining experiment designs. It is possible to automatically identify every variable, parameter, or other element that is changed between two versions of a process. See Figure 9 below and the example in this video. Once identified, these differences can be accepted, rejected, and merged into a new version of the process.
Although they are just the beginning of many more great things to come, these three capabilities form the foundation needed to construct a GitHub for Science.
It is worth noting that all of the examples above were executed using open-source code running on software platforms outside of Riffyn SDE. The role of Riffyn SDE is to provide the system-agnostic source code and data for the requested experiments. That data can be read, used, and modified by any other software. Thus, this source code for science is free to live and evolve outside of the boundaries of the system that initially generated it – just as true source code should be.
From GitHub to manufacturing hub
In software engineering, GitHub is not the final resting place for source code. To be useful, it has to be transferred to a production environment. That is accomplished by automated systems that transfer the code to servers and clients that run it. Similarly, in the design and manufacturing of hardware, a CAD file can be transferred to a CAM machine or 3D printer for automated or semi-automated fabrication.
Likewise, with scientific source code, this automated transfer from R&D to manufacturing becomes possible for science. For example, as analytical chemistry methods or bioprocesses evolve in R&D they approach manufacturing-ready form. The process objects (source code) that capture methodological and process details can be digitally transferred to manufacturing execution systems to operate production facilities (Figure 10).
Not only would this reduce errors and accelerate timelines, it would also provide a means to map manufacturing data directly to development data. Thus, performance differences between development-scale processes and manufacturing can be compared to quickly diagnose deviations and find opportunities for process improvement. Ultimately this will lead to better product quality, more predictable product development, and lower costs for drug, chemical, material, and biotech product development.
A new era in scientific experimentation
We owe a great societal debt to the wonders of science and the profound power of the scientific method that underlies it. That method, and the traditional means of recording and reporting results, have delivered a high quality of life for people. But as our society has grown larger and more complex, and the pace and pressures of society have increased, so have our demands on science to deliver faster and more profound innovation. The 400-year-old toolset of scientific communication needs an upgrade.
The foundation for that upgrade is source code for science, and it’s already here. It’s been used to transform the operations of Riffyn’s customers. It has already taken biotech products to market faster than ever before. Yet this source code-based approach has only taken its tentative first steps. We can’t wait to see what it will deliver when it hits full stride.