My colleague Emery Berger recently pointed me to the paper Single versus Double Blind Reviewing at WSDM 2017. This paper describes the results of a controlled experiment to test the impact of hiding authors’ identities during parts of the peer review process. The authors of the experiment—PC Chairs of the 2017 Web Search and Data Mining (WSDM’17) conference—examined the reviewing behavior of two sets of reviewers for the same papers submitted to the conference. They found that author identities were highly significant factors in a reviewer’s decision to recommend the paper be accepted to the conference. Both the fame of an author and the author’s affiliation were influential. Interestingly, whether the paper had a female author or not was not significant in recommendation decisions. [Update: a different look at the data found a penalty for female authors; see addendum to this post.]
I find this study very interesting, and incredibly useful. Many people I have talked to have suggested that we scientifically compare single- with double-blind reviewing (SBR vs. DBR, for short). A common idea is to run one version of a conference as DBR and compare its outcomes to a past version of the conference that used SBR. The problem with this approach is that both the papers under review and the people reviewing them would change between conference iterations. These are potentially huge confounding factors. While the WSDM’17 study is not perfect, it gets past some of these big issues.
In the rest of the post I will summarize the details of the WSDM’17 study and offer some thoughts about its strengths and weaknesses. I think we should attempt more studies like this for other conferences.
The WSDM’17 conference review process is fairly standard. Once all papers are submitted, reviewers look at them to decide which ones they could capably review; this is called bidding. Reviewers mark papers they could review as yes or maybe; not marking a paper signals no, they cannot review it. The PC Chairs then apply a semi-automated algorithm to assign reviewers to papers. Reviewers read each paper assigned to them and render a judgment, which is either strong accept, accept, borderline, reject, or strong reject. Reviewers also judge a paper’s quality relative to other papers they were assigned to review; papers can be binned into the bottom 25%, the lower-middle 25%, the upper-middle 25%, or the top 25% of reviewed papers.
The reviewers are broken into two pools: one pool may learn authors’ identities during bidding and reviewing, but the other should not. To ensure this, all submitted papers were required to have the authors’ names and affiliations removed from their front page. 1 Those reviewers in the non-blinded pool were shown a paper’s authors by the on-line conference review system.
The paper assignment algorithm assigns two reviewers from each pool to each paper; as such each paper as two blinded reviewers and two non-blinded ones. Once all reviews are in, author identities are revealed to all reviewers. Doing so puts all reviewers on a level playing field during discussions, which should hopefully avoid harming outcomes of submitted papers due to the experiment.
In total, 500 papers were submitted to the conference, creating an ample source of data to analyze.
The WSDM’17 chairs wanted to figure out how much author identity and various other factors might contribute to a reviewer’s assessment of a paper. 2 In particular, they developed a logistic regression model that aims to predict the likelihood that a single-blind reviewer will give a paper a positive recommendation. 3 The model considers as inputs several features of a paper, including (a) whether the paper has a female author, (b) whether one of the authors is famous, and (c) whether one of the authors is from a top institution. 4 It also considers a “quality score” for the paper as an input feature, where this score is determined by the assessment of the blinded reviewers. This score is a combination of the blinded reviewers’ recommendation for the paper and their ranking of the paper relative to the other papers they reviewed. 5 This is the blinded paper quality score, or bpqs.
The idea with this setup is that if there is no bias in a non-blinded reviewer’s score, then the logistic regression will find that bpqs is the only significant predictor of recommendation score. That is, the blinded and non-blinded reviewers will largely agree. However, if there is bias due to knowing author identity, then one of the other input features will end up playing a significant role in predicting non-blinded reviewers’ recommendations (too).
The logistic regression model ended up identifying three statistically significant input features (i.e., those with p value < 0.05). The one with the largest effect was bqps, which is a helpful sanity check. The other two were (b) and (c): whether one of a paper’s authors is famous (p < 0.006), and whether one is from a top institution (p < 004). The effect of these variables was sizable: the corresponding odds multipliers were 1.82x and 1.68x for (b) and (c), respectively. Read another way: the probability that a paper was deemed acceptable by an unblinded reviewer went up by 1.82 times if the paper had a famous author, and 1.68 times more, on top of that, if an author was from a famous institution.
Note that to the extent that features naturally correlate with paper quality, some of their effect should be present in bqps. So if we believe that researchers at top institutions, or who are famous, are naturally more likely to produce great papers, then their bqps score should tend to be higher. If so, the independent effect of (b) and (c) would be reduced. But the fact that (b) and (c) are still high suggests they may be playing a role in recommendation decisions.
I find it very interesting that whether an author is/was female did not have a significant effect. This result is apparently consistent with an earlier study of journal reviewing from the late 80s by Rebecca Blank. 6 I’d be curious to explore why gender seems to not play a role in this setting, but apparently does in music auditions, grant reviews, or (scientific) job applications.
One thing I wonder about in the present study is a lack of consideration of expertise in reviews. What if the blinded reviewers were more expert than the unblinded ones, or the other way around? The fact that (un)blinded reviewers are (un)blinded during bidding, too, might compound the issue, if there was some (anti-)correlation between reviewer expertise and bidding choice. I would be very surprised to see that each paper had four expert reviewers; in my experience on large conference committees one should feel lucky to average 1-2 true expert reviews per paper.
Another thing I wonder about is the impact of being able to guess authorship, despite blinding. A prior study I carried out found that (for a much smaller conference) 1/3 of time, reviewers felt they could guess an author of a paper they were reviewing, and they were right 4/5 of the time they guessed. How are recommendations of blinded reviewers affected when they have a good guess about authorship? My intuition is that this doesn’t matter. Rather, I suspect that many reviewers are implicitly, rather than explicitly, biased. As such, not knowing the authors of a paper as they start their review will improve fairness significantly.
This study is interesting and thought provoking. We should do more like it!
Update: Penalty for Female Authors
The authors of the WSDM updated their paper since this post was written. They find a penalty to women authors, but because the numbers were so small in their sample, the penalty is only statistically significant when combined with other data in a meta-analysis. The new abstract to the paper is as follows:
In this paper we study the implications for conference program committees of adopting single-blind reviewing, in which committee members are aware of the names and affiliations of paper authors, versus double-blind reviewing, in which this information is not visible to committee members. WSDM 2017, the 10th ACM International ACM Conference on Web Search and Data Mining, performed a controlled experiment in which each paper was reviewed by four committee members. Two of these four reviewers were chosen from a pool of committee members who had access to author information; the other two were chosen from a disjoint pool who did not have access to this information. This information asymmetry persisted through the process of bidding for papers, reviewing papers, and entering scores. Reviewers in the single-blind condition typically bid for 22% more papers, and preferentially bid for papers from top institutions. Once papers were allocated to reviewers, single-blind reviewers were significantly more likely than their double-blind counterparts to recommend for acceptance papers from famous authors and top institutions. The estimated odds multipliers are 1.76 and 1.67 respectively, so the result is tangible. For female authors, the associated odds multiplier of 0.82 is not statistically significant in our study. However, a meta-analysis places this value in line with that of other experiments, and in the context of this larger aggregate the gender effect is statistically significant.
- I presume that citations to the authors’ own work must be made in the third person, too, but writeup does not say this explicitly. ↩
- Note that the study also considers the impact of the same input features on bidding behavior, asking how likely it is that they predict whether a reviewer will bid for a paper. Please see the paper for more discussion about this. ↩
- The five recommendations are mapped to numeric scores 6, 3, -2, -4, and -6 for strong accept, accept, borderline, reject, an strong reject, respectively. The logistic regression aims to produce a model from the collected data that predicts the likelihood that the assigned score is non-nonnegative given the values of various input features. ↩
- The other three features considered in the model are (d) whether a majority of authors are from the USA, (e) whether a reviewer is from the same country as one of the authors, and (f) whether a majority of authors is from an academic institution. These ended up not being significant contributors to score. ↩
- More precisely, for each blinded reviewer we add the recommendation’s numeric score to 0.175r, where r=4 means the reviewer ranked the paper in the “top bin” and r=1 means “bottom bin” (and 2 and 3 are in between). The average of the scores of the two blinded reviewers is the paper’s quality score. ↩
- That study found that those from mid-tier institutions benefited from double-blind review, but those from top or bottom institutions did not. ↩