Ineffective Theory
Every once in a while, the topic of peer review (and, usually, why it’s Horrible™) comes up on Hacker News, X, and I guess pretty much everywhere else. Here is a not-so-recent example (it was recent when I first started drafting this post). A typical quote begins:
Peer review is completely broken nowadays […]
Obviously, since I’m writing this note, I disagree. I do think that peer review is misunderstood—sometimes (very unfortunately) by reviews and authors, often by the informed public, and essentially always by the media. Below, I’m going to try to lay out how I think of peer review in the age of arXiv and Twitter.
First, it’s important to understand that science, like everything else, is massively heterogeneous. There are enormous differences in norms, habits, and so on even between physics and astrophysics, both very quantitative, computational, “give me plots not words” fields. There are also rather large differences between different institutions and different generations. I can’t speak with any confidence to how things work outside of my corner (nuclear theory). Take everything below with that grain of salt.
The life of a paper, once drafted, looks like this:
- First it’s sent to “friends” of the authors, or in general people who they think might be interested. This step is optional—not everybody does it—but if you read “we are grateful to so-and-so for many useful comments on an early version of this manuscript”, this is typically the step it’s referring to.
- After getting feedback (or waiting a suitable amount of time), the paper is posted to arXiv. This is the big moment, and the point at which people say “congratulations”.
- Now the authors most likely wait, perhaps a few weeks. People will write to complain about not being cited, or perhaps say something more constructive.
- The authors submit the paper to some journal. In my field this is, more likely than not, Phys Rev D. Likely, at the same time, the arXiv version of the paper is updated to take into account the changes of the last few weeks.
- After some time—perhaps one month, perhaps three—the journal sends back a referee report. Quality of referee reports varies wildly. Sometimes it’s difficult to shake the feeling that the referees didn’t actually read the paper; other times (perhaps less often) referees manage to make comments substantial enough that they might have earned co-authorship in another context. This isn’t the place for “irritating referee” stories, but everybody has them. (I think everybody also has stories of that time they were the irritating referee.)
- There may be one or two rounds of back-and-forth with the referee. After some amount of improvements to the paper, the referee typically declares himself happy, and the editors accept the paper.
- A few weeks later, the paper is officially “published”, and will be referred to in the media as “peer-reviewed”. The few-week delay is because the journal employs some underpayed lackeys to edit the paper for typos and grammar, and also to “improve” formatting. Frequently enough, this introduces substantial errors into the paper which the authors do not manage to catch. Partly for this reason:
- Almost anybody who wants to read the paper, reads it on arXiv instead of from the journal.
Most of the meaningful “peer review” happens immediately before and after the arXiv posting. The one or two referees are largely incidental, and are less likely to give high-quality feedback than the hand-picked colleagues to whom the paper was originally sent.
There are many papers that just aren’t read that carefully before peer review. Maybe the authors don’t have friends in the right niche. Similarly, there are papers which are read by a few of the authors’ friends, but not by anyone outside of some tight-knit community. Formal peer review serves as a pretty strong push to incentivize both authors and potential readers to transfer information across subfield boundaries. That’s quite valuable.
It’s not clear if this situation is stable. Why not just neglect journal submissions altogether? There’s a weak incentive to get your papers into journals: it looks weird, when applying for jobs (and funding), if you don’t. Younger folks care less about this than older faculty, though, so it’s possible that the incentive to shepherd papers through the review process will get weaker over time, diluting one of the few formal mechanisms for accountability. Personally I rather like peer review as an institution (ever if I frequently can’t stand dealing with referee reports), and hope that something much like it sticks around. I can’t imagine what.
Lastly note that, in all of the above, nobody had the job of “check that the
paper is correct”. The authors do their best, of course, but can’t be
meaningfully said to “check” anything. The recipients of an early draft of the
paper are likely best-positioned to comment on correctness, and in my
experience they often do, but they can hardly be said to be an uninterested
third party. The reviewers assigned by the journal are not checking for
correctness, but rather for obvious errors, and evaluating for relevance. As a
result papers are “published” without any dedicated check specifically designed
to weed out incorrect results (with the exception of those measures put in
place by the authors themselves).
This is as it should be! Published papers are the primary mechanism for
physicists to communicate with each other. They’re published because open
communication is preferable to just sending private notes between a small group
of friends. To the extent that formal publication (as opposed to just posting
to arXiv) accomplishes anything, it raises the profile of the paper to other
researchers.
This is critical: the publication of a paper is intended as a signal to other
researchers (perhaps including those somewhat outside the original field). It
should not be taken as a signal to the media or to the general public, and the
institution of formal peer-review is completely unsuited to that task.
I think this raises an important question: what institution is responsible for
signaling to the broader public that a result is correct? I don’t think we
really have one. There are practices that appear to attempt to fill this role:
press releases, science journalists, review articles, and white papers. Press
releases are clearly untrustworthy. Science journalism would be the most
trustworthy source if the journalists were good, but this requires both deep
technical knowledge and all the standard journalistic practice, and I think
that’s quite rare. Review articles are extremely valuable, but often written by
interested parties, and incomprehensible to people too far from research. White
papers do better, but I think are uncommon, and are still not often neutral.
As far as I can tell, creating a trusted and trustworthy institution for
transferring information from researchers to the public remains an open
problem.
Under many models of voting behavior, elections disproportionately (relative to
present-day demographics) represent the interests of the elderly. This happens
organically, without needing to consider anything like the accumulation of
wealth, social capital, or other relatively “un-democratic” structural factors.
We need only assume that people’s interests are correctly represented in policy
(it’s sufficient for the median voter theorem to hold), and some degree of
gerontocracy often follows.
Minimal model: Everybody votes for their future interests, with no discounting. So, a \(40\)-year-old places equal weight on the interests of somebody aged \(42\) and somebody aged \(79\). Since this is meant to be illustrative, let’s assume everybody lives exactly \(80\) years (and everybody knows this fact). Then the electoral weight assigned to the interests of someone of age \(Y < 80\) is
\[
W(Y) = \int_0^{Y} \!dT\, P(T)
\text.
\]
Here \(P(T)\) is the current population at age \(T\). Note that I’ve assumed that everyone can vote from birth!
Policy is set by the median voter, but does not reflect the interests of that voter at his current age. Rather it reflects his interests averaged between his current age and his death at \(80\). Perhaps more intuitively, policy is set by the age \(Y_*\) of the median bit of electoral weight:
\[
\int_0^{Y_*} W(T)\, dT = \int_{Y_*}^{80} W(T) \,dT
\]
To give some intuition, consider two dramatically different population pyramids.
- A steady-state society with no growth (and, again, deterministic death at 80) has a flat pyramid \(P(T) = P_0\). The median age is \(40\); democracy will best reflect the interests of someone of age \(60\).
- A young society, with all ages equally distributed between \(0\) and \(40\). Democracy best reflects the interests of someone aged \(50\)—older than any current member of society.
Without discounting, younger societies (but holding life expectancy fixed) are more severely gerontocratic.
Future discounting: Now we let voters display time preference. In general we write a discount function \(f(T)\), normalized to \(f(0)=1\), indicating how much the voter cares about events time \(T\) in the future. Without needing to do any calculation, it’s clear that this is going to alleviate gerontocracy; how much depends largely on how steep the discount function is.
Naively, a rational discount function is just a decaying exponential \(f(T) = e^{-T/T_0}\). Empirical work shows that the typical discount function is in fact “hyperbolic”, which here just means that the decay rate is greater at early times and lesser at later times. That is, the ratio of utilities at two times \(\delta\) apart:
\[
R(\delta; T) = \frac{f(T+\delta)}{f(T)}
\]
is dependent on how far in the future we’re making this measurement (and gets closer to \(1\) as we imagine things further in the future).
Note that in most democracies, there’s a good delay (several months) between the act of voting and the moment the newly-electeds take office. The delay between the act of voting and the arrival of consequences is even longer. That means that the effective discounting rate at work is the lesser, long-term one; in other words, the relevant discount function is fairly close to flat. I suppose this suggests an unconventional remedy to gerontocracy: increase the voters’ rate of time preference by shortening the time between the election and the arrival of consequences. The tradeoff, of course, is that government collectively shifts its time-horizon nearer.
Children: Now we add family relations into the mix, and attempt to account for the fact that people vote for the interests of their relatives. There are many possible versions of this model:
- Children consider their parents’ interest, and parents their children’s;
- Voters consider the interests of all ancestors (parents, grandparents) and descendents;
- Voters consider the interests of their household
In all of the above we also have to consider the weights that people assign to
their relatives. It seems sensible to assume that these weights are reciprocal: if Sally places weight \(A\) on her own interests and weight \(B\) on those of John, then John weights his own interests \(A\) and Sally’s \(B\). As long as weights are reciprocal in this fashion, children and other family relations change voting patterns, but do not affect policy.
I show you a coin, and I tell you “both sides of this coin are heads”. However,
I refuse to give you the coin—for whatever reason, I insist on
demonstrating this fact by repeatedly flipping it. If I’m telling the truth,
it’ll always come up heads.
Okay, so I start flipping it, and lo and behold it always comes up heads. I
flip it, say, \(1000\) times, and every time it comes up heads. The chance of
this happening with a fair coin is \(2^{-1000}\). Is your probability that the
coin is fair now of order \(2^{-1000}\)?
No, of course not. The coin could have two heads (maybe it’s more likely than
not), but it could also be that I’m somehow being sneaky in flipping it. Maybe
it’s weighted (probability at least a few percent), or I’m playing some trick
with a magnet (seems like a good idea; probability at least 1%), or I’ve
drugged you into hallucinating. All of these probabilities are comically larger
than that naive \(2^{-1000}\). And none of these probabilities is affected by
you watching me flip the coin another time!
This is part of why constructing small probabilities is
hard. It is never sufficient to keep gathering
more of one type of evidence, since eventually you run up against the
(unmoving) probability that this type of evidence happens to be worthless. Your
posterior probability estimate saturates at some value strictly less than
\(1\) (or greater than \(0\)).
Hopefully this is all obvious, but it has some consequences.
Alex Tabarrok argues that one should trust literature, not
papers.
I cannot agree. A paper that makes a concrete argument, I can read closely,
follow, and come to a sort of “meeting of minds” with the author. In other
words I can understand it, and whatever pieces of evidence the authors
believe they have found, I can take them on board as evidence myself. If it’s a
good paper, then at the end, I will have updated my estimates in the same
manner as the authors did. A paper—one paper—is a fantastic tool
for this. I know of no better.
A literature, though. I cannot read and understand a literature. The
existence (and size, and other bulk characteristics) of the literature must
themselves serve as evidence. But this is poor evidence! If a priori I
believe that “a literature on a topic this important has a 40% chance of being
substantially corrupted by social/political/economic forces”, then the maximum
probability I can assign to a typical claim is going to be 60%. (Those are, in
fact, typical of my estimates in these situations.)
This wouldn’t be such a problem if I knew how to evaluate a large set of papers
to determine whether there are such corrupting forces at work. But I
don’t—certainly not well enough to reduce that 40% below, say, 20%. I have
enough trouble constructing probabilities outside of \([0.1,0.9]\) in
physics, a field without strong external social forces (and where I actually
work).
The result is that, outside of cases where a single paper (or a small set of
papers) can lay out a convincing argument, a rational person probably shouldn’t
be strongly convinced of any individual claim coming out of research. Scenarios
where the literature is systemically biased are too likely, and are not made
much less likely (at least as far as I can tell) by the literature being large,
or using diverse methods, or using the latest techniques.