Archive

Ineffective Theory

A Note on Peer Review

Every once in a while, the topic of peer review (and, usually, why it’s Horrible™) comes up on Hacker News, X, and I guess pretty much everywhere else. Here is a not-so-recent example (it was recent when I first started drafting this post). A typical quote begins:

Peer review is completely broken nowadays […]

Obviously, since I’m writing this note, I disagree. I do think that peer review is misunderstood—sometimes (very unfortunately) by reviews and authors, often by the informed public, and essentially always by the media. Below, I’m going to try to lay out how I think of peer review in the age of arXiv and Twitter.

First, it’s important to understand that science, like everything else, is massively heterogeneous. There are enormous differences in norms, habits, and so on even between physics and astrophysics, both very quantitative, computational, “give me plots not words” fields. There are also rather large differences between different institutions and different generations. I can’t speak with any confidence to how things work outside of my corner (nuclear theory). Take everything below with that grain of salt.

The life of a paper, once drafted, looks like this:

  1. First it’s sent to “friends” of the authors, or in general people who they think might be interested. This step is optional—not everybody does it—but if you read “we are grateful to so-and-so for many useful comments on an early version of this manuscript”, this is typically the step it’s referring to.
  2. After getting feedback (or waiting a suitable amount of time), the paper is posted to arXiv. This is the big moment, and the point at which people say “congratulations”.
  3. Now the authors most likely wait, perhaps a few weeks. People will write to complain about not being cited, or perhaps say something more constructive.
  4. The authors submit the paper to some journal. In my field this is, more likely than not, Phys Rev D. Likely, at the same time, the arXiv version of the paper is updated to take into account the changes of the last few weeks.
  5. After some time—perhaps one month, perhaps three—the journal sends back a referee report. Quality of referee reports varies wildly. Sometimes it’s difficult to shake the feeling that the referees didn’t actually read the paper; other times (perhaps less often) referees manage to make comments substantial enough that they might have earned co-authorship in another context. This isn’t the place for “irritating referee” stories, but everybody has them. (I think everybody also has stories of that time they were the irritating referee.)
  6. There may be one or two rounds of back-and-forth with the referee. After some amount of improvements to the paper, the referee typically declares himself happy, and the editors accept the paper.
  7. A few weeks later, the paper is officially “published”, and will be referred to in the media as “peer-reviewed”. The few-week delay is because the journal employs some underpayed lackeys to edit the paper for typos and grammar, and also to “improve” formatting. Frequently enough, this introduces substantial errors into the paper which the authors do not manage to catch. Partly for this reason:
  8. Almost anybody who wants to read the paper, reads it on arXiv instead of from the journal.

Most of the meaningful “peer review” happens immediately before and after the arXiv posting. The one or two referees are largely incidental, and are less likely to give high-quality feedback than the hand-picked colleagues to whom the paper was originally sent.

There are many papers that just aren’t read that carefully before peer review. Maybe the authors don’t have friends in the right niche. Similarly, there are papers which are read by a few of the authors’ friends, but not by anyone outside of some tight-knit community. Formal peer review serves as a pretty strong push to incentivize both authors and potential readers to transfer information across subfield boundaries. That’s quite valuable.

It’s not clear if this situation is stable. Why not just neglect journal submissions altogether? There’s a weak incentive to get your papers into journals: it looks weird, when applying for jobs (and funding), if you don’t. Younger folks care less about this than older faculty, though, so it’s possible that the incentive to shepherd papers through the review process will get weaker over time, diluting one of the few formal mechanisms for accountability. Personally I rather like peer review as an institution (ever if I frequently can’t stand dealing with referee reports), and hope that something much like it sticks around. I can’t imagine what.

Lastly note that, in all of the above, nobody had the job of “check that the paper is correct”. The authors do their best, of course, but can’t be meaningfully said to “check” anything. The recipients of an early draft of the paper are likely best-positioned to comment on correctness, and in my experience they often do, but they can hardly be said to be an uninterested third party. The reviewers assigned by the journal are not checking for correctness, but rather for obvious errors, and evaluating for relevance. As a result papers are “published” without any dedicated check specifically designed to weed out incorrect results (with the exception of those measures put in place by the authors themselves).

This is as it should be! Published papers are the primary mechanism for physicists to communicate with each other. They’re published because open communication is preferable to just sending private notes between a small group of friends. To the extent that formal publication (as opposed to just posting to arXiv) accomplishes anything, it raises the profile of the paper to other researchers.

This is critical: the publication of a paper is intended as a signal to other researchers (perhaps including those somewhat outside the original field). It should not be taken as a signal to the media or to the general public, and the institution of formal peer-review is completely unsuited to that task.

I think this raises an important question: what institution is responsible for signaling to the broader public that a result is correct? I don’t think we really have one. There are practices that appear to attempt to fill this role: press releases, science journalists, review articles, and white papers. Press releases are clearly untrustworthy. Science journalism would be the most trustworthy source if the journalists were good, but this requires both deep technical knowledge and all the standard journalistic practice, and I think that’s quite rare. Review articles are extremely valuable, but often written by interested parties, and incomprehensible to people too far from research. White papers do better, but I think are uncommon, and are still not often neutral.

As far as I can tell, creating a trusted and trustworthy institution for transferring information from researchers to the public remains an open problem.

Democracies are naturally gerontocracies

Under many models of voting behavior, elections disproportionately (relative to present-day demographics) represent the interests of the elderly. This happens organically, without needing to consider anything like the accumulation of wealth, social capital, or other relatively “un-democratic” structural factors. We need only assume that people’s interests are correctly represented in policy (it’s sufficient for the median voter theorem to hold), and some degree of gerontocracy often follows.

Minimal model: Everybody votes for their future interests, with no discounting. So, a \(40\)-year-old places equal weight on the interests of somebody aged \(42\) and somebody aged \(79\). Since this is meant to be illustrative, let’s assume everybody lives exactly \(80\) years (and everybody knows this fact). Then the electoral weight assigned to the interests of someone of age \(Y < 80\) is \[ W(Y) = \int_0^{Y} \!dT\, P(T) \text. \] Here \(P(T)\) is the current population at age \(T\). Note that I’ve assumed that everyone can vote from birth!

Policy is set by the median voter, but does not reflect the interests of that voter at his current age. Rather it reflects his interests averaged between his current age and his death at \(80\). Perhaps more intuitively, policy is set by the age \(Y_*\) of the median bit of electoral weight: \[ \int_0^{Y_*} W(T)\, dT = \int_{Y_*}^{80} W(T) \,dT \] To give some intuition, consider two dramatically different population pyramids.

Without discounting, younger societies (but holding life expectancy fixed) are more severely gerontocratic.

Future discounting: Now we let voters display time preference. In general we write a discount function \(f(T)\), normalized to \(f(0)=1\), indicating how much the voter cares about events time \(T\) in the future. Without needing to do any calculation, it’s clear that this is going to alleviate gerontocracy; how much depends largely on how steep the discount function is.

Naively, a rational discount function is just a decaying exponential \(f(T) = e^{-T/T_0}\). Empirical work shows that the typical discount function is in fact “hyperbolic”, which here just means that the decay rate is greater at early times and lesser at later times. That is, the ratio of utilities at two times \(\delta\) apart: \[ R(\delta; T) = \frac{f(T+\delta)}{f(T)} \] is dependent on how far in the future we’re making this measurement (and gets closer to \(1\) as we imagine things further in the future).

Note that in most democracies, there’s a good delay (several months) between the act of voting and the moment the newly-electeds take office. The delay between the act of voting and the arrival of consequences is even longer. That means that the effective discounting rate at work is the lesser, long-term one; in other words, the relevant discount function is fairly close to flat. I suppose this suggests an unconventional remedy to gerontocracy: increase the voters’ rate of time preference by shortening the time between the election and the arrival of consequences. The tradeoff, of course, is that government collectively shifts its time-horizon nearer.

Children: Now we add family relations into the mix, and attempt to account for the fact that people vote for the interests of their relatives. There are many possible versions of this model:

In all of the above we also have to consider the weights that people assign to their relatives. It seems sensible to assume that these weights are reciprocal: if Sally places weight \(A\) on her own interests and weight \(B\) on those of John, then John weights his own interests \(A\) and Sally’s \(B\). As long as weights are reciprocal in this fashion, children and other family relations change voting patterns, but do not affect policy.

Bayesian saturation

I show you a coin, and I tell you “both sides of this coin are heads”. However, I refuse to give you the coin—for whatever reason, I insist on demonstrating this fact by repeatedly flipping it. If I’m telling the truth, it’ll always come up heads.

Okay, so I start flipping it, and lo and behold it always comes up heads. I flip it, say, \(1000\) times, and every time it comes up heads. The chance of this happening with a fair coin is \(2^{-1000}\). Is your probability that the coin is fair now of order \(2^{-1000}\)?

No, of course not. The coin could have two heads (maybe it’s more likely than not), but it could also be that I’m somehow being sneaky in flipping it. Maybe it’s weighted (probability at least a few percent), or I’m playing some trick with a magnet (seems like a good idea; probability at least 1%), or I’ve drugged you into hallucinating. All of these probabilities are comically larger than that naive \(2^{-1000}\). And none of these probabilities is affected by you watching me flip the coin another time!

This is part of why constructing small probabilities is hard. It is never sufficient to keep gathering more of one type of evidence, since eventually you run up against the (unmoving) probability that this type of evidence happens to be worthless. Your posterior probability estimate saturates at some value strictly less than \(1\) (or greater than \(0\)).

Hopefully this is all obvious, but it has some consequences.

Alex Tabarrok argues that one should trust literature, not papers. I cannot agree. A paper that makes a concrete argument, I can read closely, follow, and come to a sort of “meeting of minds” with the author. In other words I can understand it, and whatever pieces of evidence the authors believe they have found, I can take them on board as evidence myself. If it’s a good paper, then at the end, I will have updated my estimates in the same manner as the authors did. A paper—one paper—is a fantastic tool for this. I know of no better.

A literature, though. I cannot read and understand a literature. The existence (and size, and other bulk characteristics) of the literature must themselves serve as evidence. But this is poor evidence! If a priori I believe that “a literature on a topic this important has a 40% chance of being substantially corrupted by social/political/economic forces”, then the maximum probability I can assign to a typical claim is going to be 60%. (Those are, in fact, typical of my estimates in these situations.)

This wouldn’t be such a problem if I knew how to evaluate a large set of papers to determine whether there are such corrupting forces at work. But I don’t—certainly not well enough to reduce that 40% below, say, 20%. I have enough trouble constructing probabilities outside of \([0.1,0.9]\) in physics, a field without strong external social forces (and where I actually work).

The result is that, outside of cases where a single paper (or a small set of papers) can lay out a convincing argument, a rational person probably shouldn’t be strongly convinced of any individual claim coming out of research. Scenarios where the literature is systemically biased are too likely, and are not made much less likely (at least as far as I can tell) by the literature being large, or using diverse methods, or using the latest techniques.