My mind on my measure, and my measure on my mind
Popular culture would have us believe that the main tools of the scientist are the microscope, or the glassware (with bubbling, colored, smoking liquids of course!), the amazing instrument on your favorite crime scene investigation television program (everyone knows that unknown chemical compound determination takes 2 hours, after all!), and so on. The reality is that data — the product of utilizing those tools effectively — is the real tool of a scientist. Data and its analysis are what provide us evidence of whether a hypothesis is correct. The instruments and tools don’t do that — they’re resources that enable the path to an answer.
And data are captured through measurement. Yet the characterization and qualification of the systems that make those measurements seem to get short shrift in the science world — almost like an afterthought to the work done to generate reagents and samples. It does take a lot of work to construct a new strain, synthesize a chemical, develop a material and so on. And yet, what is the value of all of that effort if we aren’t sure how good are our measurements of those materials?
Getting trustworthy measurements doesn’t happen by chance. And unfortunately most scientists still don’t take the necessary first step of characterizing the quality of their measurement systems before they use them for experiments.
What is the question I’m trying to answer anyway?
Measurement is all about assessing whether the performance of a process and/or product achieves some target — for example, does one material or another lead to a better outcome? To answer this, we need to understand what is a meaningful difference in the tested materials — both pragmatically (relative to the business goals) and statistically (the literal capability of a measurement system to “see” a meaningful differences). It must be stressed that, in a commercial environment, the drivers behind this question are always the business goals — “good enough” can only be determined by what the targets of the business are.
Precision is money
Early in process development, we generally see large gains in performance toward such business goals — we’re forming new hypotheses (some are even correct from time to time!) or starting initial product development. We might enjoy step-changes in efficiency, performance, titer, etc. on the order of 10-30% during this phase. Gains seem easy. Everything seems to work. The need for precise, well-characterized measurement seems either unnecessary, or happening all by itself.
But as our development cycle matures, we tend to run out of steam, and the scale of improvement decreases. In the more mature phases, we may observe improvements on the order of 1-3%. Even without a rigorous analysis, we can understand that it gets more difficult to observe differences as the scale of improvements decreases. Simply put, smaller differences require a more capable measurement system.
This later phase of development is also when performance improvements become most valuable, and when investment in measurement capability is most critical. The bulk of profit margins live in the incremental improvements in process efficiencies after crossing the break-even line in development — every improvement past this point is directly translated into bottom-line profit.
A measurement system’s ability to detect differences is formally known as its “capability.” Most of us think of ‘capability’ as just standard deviation, the spread in our measurement of a given characteristic (i.e., precision). But it’s a bit more than that — “capability” is an assessment of the variation in a system relative to the “business goal” (i.e., the measurement difference we want to detect).
System capability also brings a quantitative meaning to the vaguer notion of “reproducibility” that is commonly discussed today in science. In a scientific study, the business goal may to detect a difference in a physiological, chemical or physical characteristic of study subjects. The reproducibility of an outcome is determined by the capability of the measurement system used to detect the difference. If it is smaller than the capability of the system, the difference will not be reproducible.
In order for our measurement system to support our business needs, we must continuously evaluate the system capability and drive its improvement. Improvements in capability can only come through a reduction in variability (i.e., lower the standard deviation). We will explain how to do that in a future blog post. For now, we will illustrate the impact and value of variance reduction in system capability and reproducibility.
How much will a reduction in variation improve my system capability?
In order to illustrate the value in reducing our variation, we will look at several cases that are typical in our measurement system development evolution (provided below in Table 1). It is not uncommon for an assay to exhibit error of 5-10% relative standard deviation (RSD) early in its use. (The % RSD, alternatively called “CV” for coefficient of variation, is the measurement standard deviation divided by its mean value.) This may drop to 3% or just below as we work on approaches to improve the precision (e.g., better sample preparation methodologies, automation, improved chromatography, etc.). And with a reasonable amount of work, we may be able to drive our assay variability down to a 1.3% RSD. Using a tool known as a power and sample size calculator, we can illustrate what happens as we lower the error this way (see Table 1).
Let’s investigate Table 1. Along the left side (in yellow) are some theoretical capabilities of our measurement system, in terms of % RSD. The columns (headered in blue) represent the sample count we’re willing to invest per condition or sample type - this is the number of replicates we’re running in a given experiment aimed at testing the mean performance of each sample. In the body of the table (in green) is the percent difference that we are able to confidently measure for a given combination of assay capability and sample count (given by the power and sample size calculator) — this is the % Difference Detectable.
First, we observe that improvement in our assay capability is directly proportional to improvement in our ability to detect differences in the assay . If we move down a column, say for n=5, we see that a 5.8X decrease in % RSD (from 7.5% to 1.3%) translates directly to a 5.8X reduction in % Detectable Difference (from 19.6% to 3.4%). On the other hand, increases in sample replication deliver far less bang for the buck. If we move along a row, say for % RSD = 3%, we see that a 6.67X increase in replication (from 3 to 20 replicates) delivers only a 3.46X reduction in % Detectable Difference (from 12.1% and 3.5%). The reason for both of these phenomena will be explained in a future post.
Reducing variation pays back twice
It is clear that the best investment to improve our ability to detect differences is by lowering our average assay variability, because at a fixed sample count, the improvement in capability is directly translated into the percent difference we can observe. Meanwhile, an increase in sample count at fixed assay capability leads to diminishing returns (and we will investigate this phenomenon further in a future post); with each additional sample we do not see a proportional improvement in our ability to measure a difference. Rather, we see decreasing value with each additional measured sample. Furthermore, an assay improvement represents a one-time cost of performing development work, while increasing sample count leads to a permanent increase in the cost of analysis — each time we run the assay at a higher sample count, we are both spending more per sample (literal cost) and losing out on potentially measuring additional samples (opportunity cost). In other words, spending the capital to improve the measurement system means saving on operational expenses in the long run without sacrificing data quality.
For advice on how to reduce variability in your assays, reach out to us at firstname.lastname@example.org.
Need help structuring your data to help you find sources of variation in your assays? Check out Loren Perelman's related blog post, Data is Like Ikea Furniture, It's Best Shipped Flat.
 This number is best-known capability that this author has observed in routine, high throughput screening applications - perhaps there are instances of even better capability! This particular capability was achieved using internal standard methodology to track tiny inconsistencies in injection volumes on GC-FID measurements.
 The power and sample size calculator is a tool that allows us to calculate the number of samples required to detect a given difference at a given confidence level (e.g., with 95% or 99% confidence) for an assay with a given standard deviation. Values provided in Table 2 were calculated using the on-board calculator within JMP’s DOE evaluation tool set with an alpha value (false positive rate) of 0.05 and a power of 0.95.