8 Mar 2022

Cheap Data!

The LHC performs something like one billion proton-proton collisions per second. Of course, this is an instantaneous rate—sometimes the LHC is shut down. Nevertheless, run 2 of the LHC lasted for about four years. In that time, the LHC performed something like $10^{18}$ collisions. The total cost of the LHC, meanwhile, is on the order of $10 billion, meaning that the cost per collision is a measly $10^{-8}$ dollars. This estimate is pretty robust to quibbling over definitions. For instance, since the LHC represents a substantial fraction of the total international budget for physics research, including the budgets of non-CERN researchers in the cost estimate won’t move the needle very much.

An alternative way of quantifying things: the LHC now stores about 90 petabytes of data per year (it varies wildly by year, and has been increasing). This doesn’t contain the results of most collisions (which are determined to be “boring” almost immediately); at the same time, it’s a substantial amount of data stored for each collision that does make the initial cuts. Arguably this is a better measure of the amount of “useful” data obtained by the LHC. Rounding down to 200 petabytes (that’s $2 \times 10^{15}$ bytes) over the lifetime of the LHC so far, we see that one dollar buys 200 kilobytes of data. The cost has been going down, and I’ve made no attempt to amortize the construction cost over the full lifetime. I expect the price to eventually fall to a dollar per megabyte.

This is not normal. In most other fields, data is orders of magnitude more expensive. Research on humans is the pathological case: each human must be compensated at least for their time, and each human can only churn out so much data per second. We can thus compute a crude lower bound for the cost per byte. If the data is in the form of answers to a True/False questionnaire, each research subject will churn out only one or two bytes of data per minute. If subjects are compensated $1 per hour, then 100 kilobytes of this sort of data costs $1000. Of course this is an extremely conservative estimate! I’ve underestimated the cost to the participant and neglected the cost of the researcher’s time, as well as the time it takes to compile the data, the cost of renting the room, and so on. Maybe a more reasonable estimate is to say that a low-end research project might cost $10,000, while collecting 100 True/False responses from each of 1000 participants. I’m still trying to be conservative, but now the same 100 kilobytes costs nearly $100,000.

We can do better by trying to collect more data: let’s point a camera at the research subject! This is a good thought, but the translation from raw video to actually usable data is extremely lossy, and typically involves a low-paid undergraduate performing “coding” (not programming, but marking when the research subject looked in various directions, for instance). The end result is similar.

Similar bounds hold for any research involving humans. I don’t expect expect any social science to come within a factor of 100 of the low cost of LHC data. Medicine can do better in very specific circumstances. A single MRI costs about $1000 while yielding around a gigabyte of data. As long as we don’t spend too much time worrying about how much of that data is meaningful (remember that LHC data goes through a strict cull before it’s stored), we get a superior per-byte cost to the LHC!

The availability of cheap data is one often-overlooked part of why physics (and nearby friends) is a qualitatively different enterprise than social science or medicine. Many other aspects of modern physics, which may seem to be cultural, or a result of historical quirks, are plausibly downstream of this embarrassment of riches.

The standard of evidence for a claim is far higher. A nominally “three sigma” result (corresponding to a p-value of .003, in social science terms) is typically referred to as “tension”, rather than being seen as a significant result.
Sophisticated (or at least “sophisticated”) mathematical modeling is the norm.
Upper bounds on effect sizes are considered publishable (and are often highly cited). These are termed “null results” in other fields, and are typically not publishable at all.
Hostility to complex statistical tests. Important results are generally accompanied by a plot on which statistical significance is obvious at a glance.

Note that although these behaviors evolved in an evidence-rich environment, that does not mean they are not useful in other circumstances. It’s important to distinguish between traits that are “adaptive” in the sense of leading to social success (and therefore are likely to be copied), and those that are helpful for discerning the truth. In evidence-poor environments, the adaptive traits need not be the ones that are most helpful for revealing true statements.