Be your own data scientist — it’s easier than you think

Originally published on SynBioBeta.

At my first job interview after my post-doc, I told the interviewer that I was a biochemist. He scoffed and said, “Do people really call themselves biochemists anymore?” He was referring to the new, sexier terms “chemical biologist” or even “biological chemist” describing the same field. I laughed externally while internally I rolled my eyes and uttered, “Semantics.”

Isn’t every scientist a data scientist?

When job postings for “data scientist” started cropping up at most major biotech and pharma companies, I assumed the same thing - that “data scientist” was just the new, sexier term for regular old scientist. After all, isn’t data analysis at the very heart of what we do as scientists? We run experiments, we analyze the data, we learn from it, then we iterate. Right?

Not exactly. As datasets are getting larger and more complex, more and more companies are turning to data scientists to do the analysis, meaning the workflow of a scientist is starting to look like this: run experiments, give data to data scientist to analyze, look at what data scientist did, assume they know more about “this stuff” than you, take their findings as truth, then iterate.

The problem with that model is two-fold. First, even though pipetting small amounts of liquid from one place to another while rocking out in lab to your favorite tunes can be cathartic at times, I don’t want my whole existence as a scientist to be relegated to “assay monkey”. Data analysis is the fun part! I don’t want to outsource that. And second, more importantly, we are creating more silos when we hand our data over to data scientists. You, as the scientist, are the BEST person to analyze the data. You know more about your experiments than the data scientist ever will. You know things like, “Oh maybe I see an increased signal here because I always run that sample in well A1, and I might be seeing edge effects,” or “Jurkat cells have more of my receptor on their surface than HEK293 cells, so cell count is not a good proxy for receptor numbers in this assay”. These are things the data scientist is unlikely to know, and may lead them to make erroneous conclusions from your datasets.

Okay, I see your point, but data scientists know scary statistical terms like “variance inflation factor” and “false discovery rate” that I’ll never understand

It’s true. Data scientists use terms like “multivariate regression” in casual conversation, which might make you squirm. Don’t fret, though. These things are actually not as hard as they seem. If you understand a few basic concepts of statistics — I’m talking theories that were worked out 100 years ago — then you can do most of the required data analysis yourself. Sure, there will be complex projects that require the help of a data scientist, but by and large, you can do the basic analyses yourself, significantly increasing the speed of the design → measure → analyze → iterate cycle.

Wait, I don’t know how to code, is that going to be a problem?

No, not anymore. One of the major barriers preventing “regular” scientists from doing their own data mining and analysis used to be writing code to pull together datasets. While this can be done by hand, it’s definitely not recommended, and is one of the main reasons organizations turn to data scientists. This is changing. Software companies such as Riffyn are breaking down this barrier by creating software that automatically joins related datasets across multiple experiments and across unit operations (think, ‘join my protein production data to my cell growth data to my flow cytometry analysis data’). This is possible even if different people generated the data at different times. The result? Scientists can generate a comprehensive data table ready for data mining and statistical analysis without knowing what a command line is. And these days statistical analysis doesn’t necessitate coding either. Programs such as JMP, Spotfire, and Tableau offer user friendly interfaces with the same statistical tools as more codecentric programs like Python or R. JMP even allows you to automate your analyses with scripts that are magically built by reading your mind* as you perform analyses in their software. (*Okay, so it’s not really magic and it doesn’t really read your mind. Actually, it tracks the analyses that you did and automatically turns it into a script for you, but same difference.)

Okay, I’m sort of intrigued...how do I get started?

Becoming your own data scientist is fairly straightforward, but it does require a shift in your thinking. In order to learn something from your data in aggregate, you have to start recording related experiments in a structured format. This will allow you to look at multiple samples or multiple experiments together later, even if they were performed in different weeks. The trends in those data will teach you things you never knew were there — about your experimental quality, and about the system you’re studying.

Yes, this means that throwing the data from each individual experiment into a new Excel spreadsheet and making bar charts for each one isn’t going to cut it anymore. Hoping that you or your colleagues will be able to decipher your random column names later isn’t going to cut it. You’ve got to be a little more forward-thinking than that. You have to think about standardizing data capture and annotation, experimental design, and statistical power.

I know what you’re thinking: “Well, I never really do the same thing twice, so how can I use a standard format for data collection? Why even bother to compare data across weeks?” I used to think this too. Until I thought about my typical workflow. As a discovery scientist, I was running cell-based screening assays, and while the first part of the experiment (samples and parameters) changed almost daily, the last part (the analysis assays) were pretty much the same every time. What if I had recorded those in a structured format? What could I have learned from my data? Well, quite simply, a whole lot.

Analytical assays—isn’t that someone else’s job?

There seems to be a pervasive attitude amongst early stage R&D scientists that their job is to understand the biology and “someone else’s” job to work out the analytical assays. When I joined my post-doc lab, someone handed me a protocol from what looked like the 1980s and said, “This is the assay we run. Just make sure to invert the reagent vial three times before you pipet it. That’s the key to getting it to work.” What the … ??? Even in industry it wasn’t much better. Our main analysis assay was an ELISA assay that was supposedly optimized six years earlier by someone who no longer worked for the company.  

I trusted the assays people handed me out of ignorance. I didn’t know there were standard methods for performing process improvement studies on those assays. I didn’t know how to perform correlation analyses or how to calculate the contribution of different parameters on the variation in results. Consequently, I never did any characterization to understand how predictive the analytical assays were that I was using. I never characterized the noise and never understood what parameters affected variance in the assay. As a result, we could have been interpreting false positives or missing key results that were lost in the noise. Worse, I now know that all the data we needed to understand variation in our assays was right under our noses; we just didn’t know what to do with it.

So...the data was right under your nose? Let me double click that.

Yes. It turns out that you can continuously assess your assay just using the data that you collect every day. No need to run “special” experiments. The key is recording your data in a structured format. (Think: my analytical readout—absorbance, for example—is always recorded in the same column alongside the experimental parameters used to generate the data, such as cell type, amount of compound, and duration of incubation.) Does this sound hard? It’s not. Software programs, such as Riffyn, make it as easy to capture data this way as it is to enter it into multiple Excel spreadsheets. In fact, it’s often even easier because it adds many of those parameters to your data automatically.

If we had recorded all of our data across the hundreds of experiments we did in this way, we could have easily pulled it all together, performed correlation analyses and understood what factors contributed the most to variation in the results. With this knowledge we could have tweaked our assays to better measure the parameters we cared about, to produce more reliable, accurate results. The bottom line? Your results will stand up over time, you’ll make better decision, and you’ll waste less time chasing down artifacts and dead-ends.

I find myself wondering now what kind of information was buried in those hundreds of uninterpretable Excel spreadsheets we generated. Did the lot number of detection antibody have a significant effect on the results of my assay? Perhaps the assay was sensitive to incubation time, and all those times I let it sit 15 min longer while I finished up a meeting actually skewed my results. I’ll never know.

But I do know this: Given how easy it is to record data in a structured format and how straightforward it is to determine which parameters affect key outcomes in these datasets, I will never throw my data in Excel again. Because, hey, I’m a biochemist experimental scientist citizen data scientist after all.

Laura Jennings-Antipov