Peer review is at the heart of the scientific process. As I have written about before, scientific results are deemed publishable by top journals and conferences only once they are given a stamp of approval by a panel of expert reviewers (“peers”). These reviewers act as a critical quality control, rejecting bogus or uninteresting results.
But peer review involves human judgment and as such it is subject to bias. One source of bias is a scientific paper’s authorship: reviewers may judge the work of unknown or minority authors more negatively, or judge the work of famous authors more positively, independent of the merits of the work itself.
The double-blind review process aims to mitigate authorship bias by withholding the identity of authors from reviewers. Unfortunately, simply removing author names from the paper (along with other straightforward prescriptions) may not be enough to prevent the reviewers from guessing who the authors are. If reviewers often guess and are correct, the benefits of blinding may not be worth the costs.
While I am a believer in double-blind reviewing, I have often wondered about its efficacy. So as part of the review process of CSF’16, I carried out an experiment:[ref]The structure of this experiment was inspired by the process Emery Berger put in place for PLDI’16, following a suggestion by Kathryn McKinley.[/ref] I asked reviewers to indicate, after reviewing a paper, whether they had a good guess about the authors of the paper, and if so to name the author(s). This post presents the results. In sum, reviewers often (2/3 of the time) had no good guess about authorship, but when they did, they were often correct (4/5 of the time). I think these results support using a double-blind process, as I discuss at the end.
The peer review process, reviewed
Here is a quick refresher on the peer review process.
Some scientists carry out some research on a particular topic and author a paper explaining their results. This paper is submitted to a venue that publishes scientific results, which in turn solicits the opinions of several experts (i.e., “peers” of the submitting scientists). After reading the paper, these experts render a judgment about whether the paper should be accepted for publication. This judgment is based on whether, in their view, the result is sufficiently important or thought-provoking, is correct, and whether it is well presented, i.e., so that it can be understood by the community that the publication venue caters to. Whether or not the paper is accepted, reviewer comments are sent back to the authors anonymously so that the authors can improve the paper and the work. Reviewer identities are kept hidden so that they can provide honest judgments without fear of retribution.
Single blind review
While reviewers are usually anonymous to authors, authors are often not anonymous to reviewers. In a single-blind review (SBR) process, authors’ identities may appear in the paper text itself and in metadata associated with the paper. The disadvantage with SBR is that despite reviewers’ best intentions, implicit (unconscious) bias can creep into their judgment. As such, the result of review may favor various groups, e.g., men over women, famous authors over unknown ones, authors at famous institutions over those at “lesser” ones, etc. Such biases have been observed in many contexts outside of peer review, and psychological studies have shown that human beings exhibit implicit bias systematically; check out the implicit association test to see the effect in yourself.
Double blind review
In a double-blind review (DBR) process, author identities are withheld from reviewers. Typically, authors’ names and affiliations are redacted from the paper’s text, and are hidden by the review management software. The paper’s text should also not to reveal the authors’ identities indirectly; e.g., authors may be required to cite their own prior work in the third person (as though it could have been done by someone else), and/or they may be restricted from broadly advertising their work while it is under review. The intention is that reviewers should consider the merits of a paper based purely on its content, and not on preconceptions about the paper’s authors.
Author blinding, revealed
An important assumption of DBR is that the steps taken to blind the paper, like removing the authors’ names, actually succeed in masking authorship from the reviewers. If a reviewer can infer the paper’s authors in spite of these steps then one might wonder whether bias will creep back in. To test the effectiveness of blinding, I conducted an experiment to measure how often reviewers could guess author identities.
The experiment was carried out as part of the review process of the 2016 Computer Security Foundations Symposium (CSF), of which I was the program co-Chair, along with Boris Köpf. CSF’16 employed a light form of double-blind review. The authors were asked to redact their names from the paper, and to cite their own prior work in the third person. Authors were not required to change the names of well-known research systems they might have been writing about, since doing so might create doubt about authorship but could create confusion about related work. Authors were also permitted to post their work to their web page, give talks about it, etc. as part of the normal scientific process.
For each paper reviewed, a reviewer fills out a form describing their judgment of the paper. For the experiment, I extended the form to ask, first of all, if the reviewer had a guess about one or more authors of the paper they had just reviewed. If so, the form asked them to list the apparent authors. They could also optionally describe the basis of their guess.
For the 87 papers submitted, the program committee (and a handful of outsiders) performed 270 reviews. In 90 out of 270 cases, a reviewer had a guess about the paper’s authors. 74 times out of 90, the reviewer guessed at least one author correctly. In the remaining 16 cases, all guesses of authorship were incorrect.
In sum, most (67%) of the time, reviewers were not sufficiently confident about authorship to have a reasonable guess. In these cases, double blind helped avoid bias based on casual knowledge of the authors, their institution, their gender or nationality, etc. In those cases that the reviewers had a guess, 82% of the time they were right. But every once in a while (roughly 1 time in 5) they were wrong.
Some other studies also consider blinding efficacy; Snodgrass summarizes some of these. I previously conducted a survey of POPL’12 reviewers to ask them to recall whether they had guessed (correctly or incorrectly) a submitted paper’s authorship. From that survey, 77% who guessed did so correctly. A flaw with this result was that it was based on recollection well after the fact. The experiment I report here ought to be more reliable since reviewers made a guess when writing their review.
Source of unmasking
Returning to the CSF’16 experiment, I asked reviewers to optionally indicate the reasons for their guess. Very often, reviewers stated that citations in the paper were a strong indicator. In particular, they assumed that the most closely related prior work was by the same authors. Many times they were right, but sometimes they were wrong. In one case, two different reviewers incorrectly guessed the authors to be those of the closest prior work. Another common basis for a guess was that a reviewer had seen an unblinded, prior version of the same paper.
Guesses and expertise
I also looked into how expertise correlates with guessing and guessing correctly. In particular, reviewers are asked to state, on the review form, their level of expertise in subject area of the paper; the options are ‘X’ for expert, ‘Y’ for knowledgable, and ‘Z’ for interested outsider. Here I found that the expertise breakdown for the 90 guesses to be X=43, Y=35, and Z=12, and for the 179 non-guesses to be X=74, Y=75, and Z=31. Just eyeballing these numbers, it does seem that those with higher expertise are a bit more likely to guess authorship.
For the 16 guesses that were incorrect, the breakdown was X=6, Y=8, and Z=2. For those who guessed right, about half were X and the other half were Y or Z; for those who guessed wrong, fewer were expert (6 vs. 10). So perhaps higher expertise correlates with likelihood of guessing, and guessing correctly, but not by very much.
Guesses and acceptance
One interesting question, originally raised in the comments below, is how guessing relates to decisions about acceptance. In particular, we might wonder whether accepted papers are more likely to be written by authors whose identities are readily guessed.
Of the 31 accepted papers, 25 of them had a reviewer that guessed the authors correctly, while in 5 cases no guesses were offered. In 6 cases, accepted papers had at least one incorrect guess, while in all but one of these there was also a correct guess. Considering individual reviews, of the 90 reviews done for the 31 accepted papers, 39 reviewers guessed right, 7 guessed wrong, and 54 had no guess.
Of the 56 rejected papers, 22 of them had a reviewer that guessed the authors correctly, while in 28 cases no guesses were offered. In 7 cases, rejected papers had at least one incorrect guess, and in 6 of these no correct guesses were offered. Considering individual reviews, of the 180 reviews done for the 56 rejected papers, 35 reviewers guessed right, 9 guessed wrong, and 126 had no guess.
Looking at these numbers, we can see that reviewers of accepted papers were more likely to offer a guess (46/90 reviews vs. 54/180 reviews), and nearly all accepted papers had at least one of their authors guessed correctly (25/31 papers as compared to 22/56 for rejected papers). Reviewers who guessed wrong did so a bit more often for rejected papers (7/46 guesses for accepted ones vs. 9/44 for rejected ones). Also, accepted papers were more likely to have multiple reviewers correctly guess authorship.
What should we take from these results?
Those against DBR might suggest that the cost of DBR to reviewers and authors is not worth the benefit. They might say that reviewers, when they guess, are very often right. The rest of the time, reviewers didn’t know the reviewed work enough to guess authors, meaning that had they known the authors it may not have influenced their judgment. But I think it’s hard to justify the latter statement without more evidence.
Indeed, those in favor of DBR might point out that very often (67% of the time) reviewers could not (or did not) guess authorship, meaning that authorship could not be a source of bias. Even for those reviewers who did guess authorship, they did so incorrectly 1 time in 5, on average. For reviewers who regularly participate in a double blind process, knowing that they are not always right may sow sufficient doubt that any guesses that have do not rise to the level of biasing judgment.
I find these arguments in favor of DBR to be convincing. Though DBR is far from perfect, it creates an expectation that authorship is not a factor in review, and it enforces this expectation sufficiently often. Moreover, the light form of DBR I used at CSF and POPL is not particularly costly, and both reviewers and authors seemed to feel it worked, as detailed in my Chair report for POPL’12.
That said, the analysis also showed that the paper of a guessed author was more likely to be accepted than a paper whose authorship was not guessed. Perhaps guessing (correctly) materially affected the final judgment? Or, perhaps being known within the community correlates with paper quality; after all, a history of publishing within a community should say something about the quality of the submitted work (we just don’t want that history to be a proxy for direct assessment).
Ideally we could go beyond studying the process, and instead measure outcomes, i.e., that DBR yields more papers from minorities, women, etc. who might otherwise be discriminated against. Unfortunately, it is very hard for me to see how to measure this effect directly, in a controlled manner. Until we can, we will have to do our best to strive for both quality and fairness and low cost.
Update, 11am EDT, June 28: In response to a comment below, I updated the post to discuss the correlation between paper acceptance/rejection and guessing authorship.