The LHC
performs
something like one billion proton-proton collisions per second. Of course, this
is an instantaneous rate—sometimes the LHC is shut down. Nevertheless, run 2
of the LHC lasted for about four years. In that time, the LHC performed
something like \(10^{18}\) collisions. The total cost of the LHC, meanwhile,
is on the order of $10 billion, meaning that the cost per collision is a
measly \(10^{-8}\) dollars. This estimate is pretty robust to quibbling over
definitions. For instance, since the LHC represents a substantial fraction of
the total international budget for physics research, including the budgets of
non-CERN researchers in the cost estimate won’t move the needle very much.
An alternative way of quantifying things: the LHC now
stores about 90 petabytes of
data per year (it varies wildly by year, and has been increasing). This doesn’t contain the results of most collisions (which are
determined to be “boring” almost immediately); at the same time, it’s a
substantial amount of data stored for each collision that does make the
initial cuts. Arguably this is a better measure of the amount of “useful” data
obtained by the LHC. Rounding down to 200 petabytes (that’s \(2 \times 10^{15}\)
bytes) over the lifetime of the LHC so far, we see that one dollar buys 200
kilobytes of data. The cost has been going down, and I’ve made no attempt to
amortize the construction cost over the full lifetime. I expect the price to
eventually fall to a dollar per megabyte.
This is not normal. In most other fields, data is orders of magnitude more
expensive. Research on humans is the pathological case: each human must be
compensated at least for their time, and each human can only churn out so much
data per second. We can thus compute a crude lower bound for the cost per byte.
If the data is in the form of answers to a True/False questionnaire, each
research subject will churn out only one or two bytes of data per minute. If
subjects are compensated $1 per hour, then 100 kilobytes of this sort of data
costs $1000. Of course this is an extremely conservative estimate! I’ve
underestimated the cost to the participant and neglected the cost of the
researcher’s time, as well as the time it takes to compile the data, the cost
of renting the room, and so on. Maybe a more reasonable estimate is to say that
a low-end research project might cost $10,000, while collecting 100 True/False
responses from each of 1000 participants. I’m still trying to be conservative,
but now the same 100 kilobytes costs nearly $100,000.
We can do better by trying to collect more data: let’s point a camera at the
research subject! This is a good thought, but the translation from raw video to
actually usable data is extremely lossy, and typically involves a low-paid
undergraduate performing “coding” (not programming, but marking when the
research subject looked in various directions, for instance). The end result is
similar.
Similar bounds hold for any research involving humans. I don’t expect expect
any social science to come within a factor of 100 of the low cost of LHC data.
Medicine can do better in very specific circumstances. A single MRI costs
about $1000 while yielding around a gigabyte of data. As long as we don’t spend
too much time worrying about how much of that data is meaningful (remember that
LHC data goes through a strict cull before it’s stored), we get a superior
per-byte cost to the LHC!
The availability of cheap data is one often-overlooked part of why physics (and
nearby friends) is a qualitatively different enterprise than social science or
medicine. Many other aspects of modern physics, which may seem to be cultural,
or a result of historical quirks, are plausibly downstream of this
embarrassment of riches.
-
The standard of evidence for a claim is far higher. A nominally “three sigma”
result (corresponding to a p-value of .003, in social science terms) is
typically referred to as “tension”, rather than being seen as a significant
result.
-
Sophisticated (or at least “sophisticated”) mathematical modeling is the
norm.
-
Upper bounds on effect sizes are considered publishable (and are often highly
cited). These are termed “null results” in other fields, and are typically
not publishable at all.
-
Hostility to complex statistical tests. Important results are generally
accompanied by a plot on which statistical significance is obvious at a
glance.
Note that although these behaviors evolved in an evidence-rich environment,
that does not mean they are not useful in other circumstances. It’s important
to distinguish between traits that are “adaptive” in the sense of leading to
social success (and therefore are likely to be copied), and those that are
helpful for discerning the truth. In evidence-poor environments, the adaptive
traits need not be the ones that are most helpful for revealing true
statements.