Tinder for Nerds

 
Let's Get it On (Versioning) Blog Post Cover Image.png
 

We need a Tinder for nerds. But not for nerds themselves. We need it for their nerdly ideas. 

Ideas are not formed from nothing; they are bred. Ideas romance each other. They mate. They mix their DNA. They have children. They are born. And they have parents.

But sadly, the breeding of scientific ideas is being shortchanged. There is no Tinder for the ideas of science nerds. And we’re all suffering directly or indirectly because of it.

In today’s distributed world, the breeding of ideas is heavily assisted by software and cloud collaboration tools like GitHub, Office 365, and Google Docs. These tools have dramatically reshaped the means and speed with which ideas are bred in our personal and work lives. Try collaborating on a document without redlining, without comments. Try it without real-time sharing. It’s laborious, it’s painful, it’s inhibitory.

One of these tools, GitHub, is a particularly significant role model for what we might want for science. GitHub is a cloud-based system for source-code development and collaboration. It’s a kind of Tinder for computer code. It allows engineers to find and duplicate code, make changes to it in their own sandbox, offer that code back to the source, easily see (redline) changes in the code (Figure 1), decide whether to accept the changes, and then merge the changes into a shared document. You can see the history (Figure 2) of these code dating activities as a network of parallel development activities (the branches), reviews (the nodes), and merges (the branches recombining). Software engineering would be light-years behind where it is now were it not for tools like this.

 
Figure 1   : Code comparison (redlining) in GitHub. Red shading highlights lines of code in a prior version that are missing from the current version. Green shading highlights code that has been added to the current version.

Figure 1: Code comparison (redlining) in GitHub. Red shading highlights lines of code in a prior version that are missing from the current version. Green shading highlights code that has been added to the current version.

 
Figure 2 : "Breeding" history of source code. The diagram shows evolving branches of code. The dots represent milestones where code was reviewed and saved to the history. The different color lines represent parallel branches of development that can be merged back together into the main code base (black) from time to time.

Figure 2: "Breeding" history of source code. The diagram shows evolving branches of code. The dots represent milestones where code was reviewed and saved to the history. The different color lines represent parallel branches of development that can be merged back together into the main code base (black) from time to time.

In Science, there is no equivalent to document redlining or GitHub. The primary tool for breeding ideas is still the whiteboard, power point presentations in group meetings, and written free-form reports or papers. Of course, these tools are essential and valuable, but they are not nearly enough.  

Science needs its own kind of source code. And it needs its own Tinder or GitHub – tools for developing, comparing, and breeding that scientific source code along with the ideas it represents.

This post outlines a perspective on what could be the source code for science, how it could work, and how it could reshape science.

So, let’s get it on!

What the hell is “source code” for science?

If we want to have a source code for science, we need to resolve a couple fundamental questions first. The first question is, “What do scientists make?” The answer will tell us what the source code will represent. Do scientists make knowledge, or theories, or data, or experiments? Or are some of those just means to an end? 

Let’s answer this via analogy to other fields. A software engineer makes code that effects actions on data or machines. An architect makes blueprints (CAD drawings) that are used to build and operate a structure. A film writer makes a script that is used to produce a movie.  And a scientist makes experiment designs that are used to generate data and conclusions.

Hopefully you see the pattern here: the creative product is the code, the blueprint, the script, the experiment design (the stuff in italics). These are completely within the control of the creator(s). And then there are the outcomes that are generated from, operated with, or interpreted using the creative product. Outcomes are based on the creative product but are not entirely controlled by the creator.

Creative products are essential to generating and understanding the outcomes. An auto engineer could not repair a car without access to the design documents (or at least a help guide that is derived from them). A software engineer could not understand or improve a program without evolving the code. A writer could not refine a film, or get actors to signup, without sharing and iterating on the script. Likewise, a scientist cannot properly interpret any generated data without understanding the experiment design that generated it.

The role of source code (or of blueprints or scripts) is to capture these creative products in tangible form and to provide a systematic means for the creator(s) to share and evolve them. Thus, we have answered the first question – the thing that scientists make is the experiment design. And thus the source code for science needs to unambiguously capture that.

Scientists make experiments. Duh. So, what’s the big deal?

We have arrived at our second fundamental question – what does source code for science look like? That may seem obvious – it’s what scientists have done for 400 years. They write down a list of materials, methods to be performed, and observations to be recorded.

But this is insufficient. Anyone who has spent any time at all in any lab on the planet knows how completely inadequate these ad hoc materials and methods are. They know that spreadsheets and written or electronic notebooks contain what are usually uninterpretable, incomplete, or ambiguous statements of what was actually done in an experiment. They are the equivalent of trying to describe a car part with paragraphs and spreadsheets – that approach doesn’t work for cars, and it doesn’t work for science.

But we have a small obstacle. While software can be embodied line by line in text (a structure that is prescribed by the sequential processing behavior of a machine), experiments are more multidimensional and potentially unconstrained. How do we encode them in a form that gives us all of the transparency and breed-ability of computer source code?

Fortunately, every scientific experiment shares something fundamental which allows us to define its source code. Every experiment is a sequence of actions (a protocol) in which inputs (things the scientist can control or study) are transformed into outputs (things the scientist can make or observe) – see Figure 3. String a series of such actions together and you have an experiment. And the data generated from an experiment describes those inputs and outputs. When we draw conclusions, it is by correlating that input/output data in some way, shape, or form.

Figure 3:    Every scientific experiment can be described as a sequence of actions where inputs are transformed to outputs.

Figure 3: Every scientific experiment can be described as a sequence of actions where inputs are transformed to outputs.

Thus, we have arrived at our answer to the second fundamental question: What does the source code for science look like?  It looks like code that captures a process flow (see Figure 4) composed of a sequence of transformation steps. Source code for science provides an explicit expression of all the actions of an experimental protocol; all of their inputs and outputs, all the properties that are set or measured on those inputs and outputs, and the flow of materials and data from one action to the next across the process.

 
Figure 4   : A scientific experiment can be described as a process flow diagram – a sequence of transformative steps with inputs, outputs, parameters, and measurements. This can serve as the source code for science because it is complete, unambiguous, and can be encoded in machine-readable form for sharing, comparison, analysis, and reuse.

Figure 4: A scientific experiment can be described as a process flow diagram – a sequence of transformative steps with inputs, outputs, parameters, and measurements. This can serve as the source code for science because it is complete, unambiguous, and can be encoded in machine-readable form for sharing, comparison, analysis, and reuse.

 

At Riffyn, we believe very deeply in the idea of scientific source code. Our whole enterprise is founded on it. But we didn’t invent the idea from nothing. Like all ideas, it has parents. In part 2 of this post (to be published October 3, 2019), we will uncover the roots of this idea and how it could lead us to something profound — a GitHub (or Tinder of sorts) for science where ideas can be bred, and a transformation of the lifecycle of scientific product development.

Timothy Gardner