Measuring Single vs. Double-blind Reviewing

My colleague Emery Berger recently pointed me to the paper Single versus Double Blind Reviewing at WSDM 2017. This paper describes the results of a controlled experiment to test the impact of hiding authors’ identities during parts of the peer review process. The authors of the experiment—PC Chairs of the 2017 Web Search and Data Mining (WSDM’17) conference—examined the reviewing behavior of two sets of reviewers for the same papers submitted to the conference. They found that author identities were highly significant factors in a reviewer’s decision to recommend the paper be accepted to the conference. Both the fame of an author and the author’s affiliation were influential. Interestingly, whether the paper had a female author or not was not significant in recommendation decisions. [Update: a different look at the data found a penalty for female authors; see addendum to this post.]

Blinded scales of justice.

Fairness is blind

I find this study very interesting, and incredibly useful. Many people I have talked to have suggested that we scientifically compare single- with double-blind reviewing (SBR vs. DBR, for short). A common idea is to run one version of a conference as DBR and compare its outcomes to a past version of the conference that used SBR. The problem with this approach is that both the papers under review and the people reviewing them would change between conference iterations. These are potentially huge confounding factors. While the WSDM’17 study is not perfect, it gets past some of these big issues.

In the rest of the post I will summarize the details of the WSDM’17 study and offer some thoughts about its strengths and weaknesses. I think we should attempt more studies like this for other conferences.

Experimental Setup

The WSDM’17 conference review process is fairly standard. Once all papers are submitted, reviewers look at them to decide which ones they could capably review; this is called bidding. Reviewers mark papers they could review as yes or maybe; not marking a paper signals no, they cannot review it. The PC Chairs then apply a semi-automated algorithm to assign reviewers to papers. Reviewers read each paper assigned to them and render a judgment, which is either strong accept, accept, borderline, reject, or strong reject. Reviewers also judge a paper’s quality relative to other papers they were assigned to review; papers can be binned into the bottom 25%, the lower-middle 25%, the upper-middle 25%, or the top 25% of reviewed papers.

The reviewers are broken into two pools: one pool may learn authors’ identities during bidding and reviewing, but the other should not. To ensure this, all submitted papers were required to have the authors’ names and affiliations removed from their front page.[ref]I presume that citations to the authors’ own work must be made in the third person, too, but writeup does not say this explicitly.[/ref] Those reviewers in the non-blinded pool were shown a paper’s authors by the on-line conference review system.

The paper assignment algorithm assigns two reviewers from each pool to each paper; as such each paper as two blinded reviewers and two non-blinded ones. Once all reviews are in, author identities are revealed to all reviewers. Doing so puts all reviewers on a level playing field during discussions, which should hopefully avoid harming outcomes of submitted papers due to the experiment.

In total, 500 papers were submitted to the conference, creating an ample source of data to analyze.

Data analysis

The WSDM’17 chairs wanted to figure out how much author identity and various other factors might contribute to a reviewer’s assessment of a paper.[ref]Note that the study also considers the impact of the same input features on bidding behavior, asking how likely it is that they predict whether a reviewer will bid for a paper. Please see the paper for more discussion about this.[/ref] In particular, they developed a logistic regression model that aims to predict the likelihood that a single-blind reviewer will give a paper a positive recommendation.[ref]The five recommendations are mapped to numeric scores 6, 3, -2, -4, and -6 for strong accept, accept, borderline, reject, an strong reject, respectively. The logistic regression aims to produce a model from the collected data that predicts the likelihood that the assigned score is non-nonnegative given the values of various input features.[/ref] The model considers as inputs several features of a paper, including (a) whether the paper has a female author, (b) whether one of the authors is famous, and (c) whether one of the authors is from a top institution.[ref]The other three features considered in the model are (d) whether a majority of authors are from the USA, (e) whether a reviewer is from the same country as one of the authors, and (f) whether a majority of authors is from an academic institution. These ended up not being significant contributors to score.[/ref] It also considers a “quality score” for the paper as an input feature, where this score is determined by the assessment of the blinded reviewers. This score is a combination of the blinded reviewers’ recommendation for the paper and their ranking of the paper relative to the other papers they reviewed.[ref]More precisely, for each blinded reviewer we add the recommendation’s numeric score to 0.175r, where r=4 means the reviewer ranked the paper in the “top bin” and r=1 means “bottom bin” (and 2 and 3 are in between). The average of the scores of the two blinded reviewers is the paper’s quality score.[/ref] This is the blinded paper quality score, or bpqs.

The idea with this setup is that if there is no bias in a non-blinded reviewer’s score, then the logistic regression will find that bpqs is the only significant predictor of recommendation score. That is, the blinded and non-blinded reviewers will largely agree. However, if there is bias due to knowing author identity, then one of the other input features will end up playing a significant role in predicting non-blinded reviewers’ recommendations (too).

Results

The logistic regression model ended up identifying three statistically significant input features (i.e., those with p value < 0.05). The one with the largest effect was bqps, which is a helpful sanity check. The other two were (b) and (c): whether one of a paper’s authors is famous (p < 0.006), and whether one is from a top institution (p < 004). The effect of these variables was sizable: the corresponding odds multipliers were 1.82x and 1.68x for (b) and (c), respectively. Read another way: the probability that a paper was deemed acceptable by an unblinded reviewer went up by 1.82 times if the paper had a famous author, and 1.68 times more, on top of that, if an author was from a famous institution.

Note that to the extent that features naturally correlate with paper quality, some of their effect should be present in bqps. So if we believe that researchers at top institutions, or who are famous, are naturally more likely to produce great papers, then their bqps score should tend to be higher. If so, the independent effect of (b) and (c) would be reduced. But the fact that (b) and (c) are still high suggests they may be playing a role in recommendation decisions.

Discussion

I find it very interesting that whether an author is/was female did not have a significant effect. This result is apparently consistent with an earlier study of journal reviewing from the late 80s by Rebecca Blank.[ref]That study found that those from mid-tier institutions benefited from double-blind review, but those from top or bottom institutions did not.[/ref] I’d be curious to explore why gender seems to not play a role in this setting, but apparently does in music auditions, grant reviews, or (scientific) job applications.

One thing I wonder about in the present study is a lack of consideration of expertise in reviews. What if the blinded reviewers were more expert than the unblinded ones, or the other way around? The fact that (un)blinded reviewers are (un)blinded during bidding, too, might compound the issue, if there was some (anti-)correlation between reviewer expertise and bidding choice. I would be very surprised to see that each paper had four expert reviewers; in my experience on large conference committees one should feel lucky to average 1-2 true expert reviews per paper.

Another thing I wonder about is the impact of being able to guess authorship, despite blinding. A prior study I carried out found that (for a much smaller conference) 1/3 of time, reviewers felt they could guess an author of a paper they were reviewing, and they were right 4/5 of the time they guessed. How are recommendations of blinded reviewers affected when they have a good guess about authorship? My intuition is that this doesn’t matter. Rather, I suspect that many reviewers are implicitly, rather than explicitly, biased. As such, not knowing the authors of a paper as they start their review will improve fairness significantly.

This study is interesting and thought provoking. We should do more like it!

Update: Penalty for Female Authors

The authors of the WSDM updated their paper since this post was written. They find a penalty to women authors, but because the numbers were so small in their sample, the penalty is only statistically significant when combined with other data in a meta-analysis. The new abstract to the paper is as follows:

In this paper we study the implications for conference program committees of adopting single-blind reviewing, in which committee members are aware of the names and affiliations of paper authors, versus double-blind reviewing, in which this information is not visible to committee members. WSDM 2017, the 10th ACM International ACM Conference on Web Search and Data Mining, performed a controlled experiment in which each paper was reviewed by four committee members. Two of these four reviewers were chosen from a pool of committee members who had access to author information; the other two were chosen from a disjoint pool who did not have access to this information. This information asymmetry persisted through the process of bidding for papers, reviewing papers, and entering scores. Reviewers in the single-blind condition typically bid for 22% more papers, and preferentially bid for papers from top institutions. Once papers were allocated to reviewers, single-blind reviewers were significantly more likely than their double-blind counterparts to recommend for acceptance papers from famous authors and top institutions. The estimated odds multipliers are 1.76 and 1.67 respectively, so the result is tangible. For female authors, the associated odds multiplier of 0.82 is not statistically significant in our study. However, a meta-analysis places this value in line with that of other experiments, and in the context of this larger aggregate the gender effect is statistically significant.

5 Comments

Filed under Process, Research, Science

5 Responses to Measuring Single vs. Double-blind Reviewing

  1. Nice post, Mike. But you say nothing about ‘light’ double-blind vs true double-blind. There is evidence, as you note, that the influence of the author’s name and institution is implicit or subliminal, and such effects can easily show up under light double blind. I think we desperately need a study to assess the issue, but it’s not clear how to design one.

    • For those who don’t know, “light” double-blind prescribes just what this WSDM’17 study did: making all authors’ names visible after reviews are submitted, prior to discussing papers and making final decisions.

      There are (at least) two reasons for doing such unblinding: (1) It puts all reviewers on equal footing during discussions. In particular, even though some reviewers will not know the authors of a paper after they’ve reviewed it, other reviewers will likely have a very good guess (per the various studies and surveys I’ve done, one of which I cited in the post). Such reviewers can argue from that position of knowledge while plausibly denying they have it, at the same time. There are also some cases where authorship might be material, and those cases can be considered. For example, a review complaining that a paper did not sufficiently consider prior work can be amended if it turns out that such work is due to the very same authors! (2) If a review comes in prior to the final deadline, the reviewer can safely suggest additional reviewers without worry of a conflict of interest (like asking someone to review their own paper), since they know the authors. This leads to more decentralized and efficient search of expert reviewers, rather than have everything go through the PC Chair.

      Prior surveys I’ve taken of those participating in light DBR seem to like it compared to DBR-to-the-end and SBR (both authors and reviewers).

      But not everyone feels that way and there are arguments that blinding should persist up to final decisions. The main argument I’ve heard is that bias might still exist (implicit or explicit) even after one has submitted one’s review, so we should try to avoid it. The claim is that this risk is higher than the purported benefits I’ve listed above. The question is how to balance these concerns.

      Phil, what evidence are you referring to that institution or author identity has a subliminal effect that can “easily show up under light double blind” ? I’m not aware of any such evidence. I agree it would be great to study this, but I haven’t thought about how you might do so.

  2. Thanks for the response, Mike.

    In answer to your questions: There is a lot of evidence, including the study you cite, that reviewer are affected by knowing who the author of a paper is without realising the impact of that effect. I expect the reviewers in the study you cite all believe they were being impartial and not unduly effected by knowledge of who wrote the paper. There is no reason to believe that these effects do not also show up at the final unblinding under light double blind. I concede that’s not the same as evidence that they do show up—that’s the problem, we have no evidence either way.

    So far as I am aware, computing is unique in using light double blind. I just did a search on the term, and all the results on the page are computing related—the top result is a FAQ you wrote! Why is it appropriate for us to adopt procedures so different than those used by others?

    • You say “There is no reason to believe that these effects do not also show up at the final unblinding under light double blind.”

      While I know of no study that tests this question directly, I think it’s too strong to say there’s no evidence that bias might be lessened if identities are revealed post-review.

      In Daniel Kahneman’s terminology, there are two kinds of thinking: System 1 and System 2. System 1 thinking is confident and efficient, but approximate. System 2 is systematic and rational, but inefficient and slow. Many of our decisions are handled by System 1, and its quickness and confidence can lead to implicit bias. One relevant phenomenon is “anchoring”: When asked to carry out a task, your initial impressions at the outset strongly influence the outcome. System 1 sets a bar that System 2 then reasons from. E.g., see this video: https://www.youtube.com/watch?v=HefjkqKCVpo

      I believe this is the situation we are trying to avoid with double-blind review: We do not want reviewers systematic consideration of a paper to start with an impartial initial judgment based on author identity. We want their initial judgment to be impartial. And we are getting that with light double blind. Once a reviewer has gotten to the end of the process they are asked to render a judgment about a paper, and justify that judgment in writing. It stands to reason that at this point revealing authorship will have far less effect, since a lengthy systematic evaluation (with rational System 2) has already taken place. The other benefits of light DBR, to fairness and obtaining expert reviews, hopefully outweigh any remaining effect from bias.

      While the above discussion is not a proof, or empirical evidence, it is “reason to believe” that bias is less pronounced if authorship is revealed post-review. It would be great to test whether this argument is true, but I’m not so sure how to do that.

  3. Chad Wellington

    One of the more annoying aspects of gendering is “stereotype vulnerability”, that awareness (let alone belief in!) of a stereotype tends to change the outcomes for both participants and observers.

    In music, this effect was pronounced because of the stereotype of the “superior male lungs” or whatever else that the observers believed in, filtering their view of auditions. Proper blind auditions were the only way to eliminate this effect in those convinced of (or having difficulty overcoming) the stereotype.

    I would hazard a guess that a belief in the equality of the sexes is more prevalent in the academic halls, so stereotype biases at the peer-review level is much lower. This, even if the cultural practices (that form the profession long before peer-review submission) don’t yet do a good job at enticing and retaining an academic demographic representing the population’s proportions.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.