Peer review, and why it matters

Scientific results intersect with all aspects of the modern world, and they underpin many important decisions made at the level of governments, corporations, and individuals. When you read about a scientific result—like the correlation of a particular gene’s mutation with a certain disease, or that it’s better for children to engage in unstructured play rather than structured learning at early ages—why should you believe it?

If you were an expert, you might be able to (with time) capably judge the work itself. But what if you are a non-expert, which is increasingly likely as branches of study become hyper-specialized? In this case, you are left to trust the scientific process, i.e., the way in which scientific work is judged to be true and important. At the heart of this process is peer review.

In this post I describe elements of the peer review process we use in scientific research about programming languages.

Of particular note is PL’s heavy use of peer reviewed conferences. Like other areas of computer science, conferences tend to be the main dissemination vehicle of the best results, and as such use rigorous peer review processes to select them.

While processes are important, we cannot lose sight of the critical role played by peer reviewers. Just like a democracy relies on motivated, informed, and conscientious citizens to vote for the best leaders, science relies on similarly motivated, informed, and conscientious peer reviewers for bringing the best results to light. In a future post, I will present advice for reviewers on writing what I believe are high quality peer reviews.

What is peer review?

Peer review is the process by which a scientific result is (initially) accepted by the scientific community. This process is straightforward.

A group of scientists carries out some research on a particular topic and writes a paper explaining their results. This paper is submitted to a venue that publishes scientific results. After an initial judgment by an editor or program chair (after a quick check of topic appropriateness), the paper is subjected to peer review: a group of peers with knowledge in the topic of study are given the paper to review. After reading the paper, they render a judgment about whether the paper should be accepted for publication. This judgment is based on whether, in their view, the result

  • is sufficiently important or thought-provoking,
  • is correct, and
  • whether it is well presented, i.e., so that it can be understood by the community that the publication venue represents.

Whether the paper is accepted or not, reviewer comments are sent back to the authors (anonymously) so that the authors can improve the paper and the work.

Conferences and journals

Peer review is nearly always employed by academic journals. Typically, a journal has an editorial staff, consisting of respected researchers in the field, who solicit peer reviews, and based on them decide whether to accept a paper. Acceptance might involve several rounds of review, each following paper revisions, so that the reviewers can confirm that any flaws they identified previously have been fixed. Reviews could take weeks to months to perform.

In computer science, peer review is also employed by conferences. This can be a source of confusion for those outside of computer science because in other fields, conferences rarely employ peer review. Not only are CS conference publications reviewed, the review processes are often extremely rigorous, and on par with processes used by top journals. There has been a debate for many years about whether conference review processes should be made more like journal processes, with strongly held beliefs on either side.

Peer review is the gateway, not the final judge

Whatever the peer review process, it will never be perfect. As such, publication of a scientific paper is the first important step of establishing the validity of the work, not the last. What comes next is vigorous discussion within the scientific community, and the general public, as well as followup research to continue to explore the validity of the published results.

Peer review in programming languages research

Top venues in PL employ rigorous peer review processes.

In fact, the steering committees of two of PL’s flagship conferences, POPL and PLDI, have recently issued documents, called Principles of POPL and Practices of PLDI, that describe a “contract” with authors (and the public) about the review process that will be used from year to year. These review processes have several notable features:

Blinding. Despite reviewers’ best efforts, implicit (unconscious) bias can creep in (check out the implicit association test to see this effect in yourself). Sometimes the review process aims to correct for this. In particular, reviewers are typically anonymous to authors—such single-blind reviewing (SBR) empowers the reviewers to make accurate judgments without fear of retribution. Sometimes, authors are also anonymous to reviewers—such double-blind reviewing (DBR) aims to avoid bias in reviewer judgments in favor of known groups, or against unknown ones. Both POPL and PLDI employ a “light” form of DBR that hopefully helps reduce bias without imposing high costs on authors.

Author response. A paper might be nearly publishable, but not quite ready. Whereas journal review processes allow reviewers to iterate with authors, conferences are usually a one-shot process: papers with flaws are rejected. POPL and PLDI support a lightweight form of iteration called “author response” or “rebuttal” to the reviews they have received, which is taken into consideration before a final judgment is made. If the paper is not accepted, it can go to the next conference in a few months’ time.

Three or more reviews. Despite our best efforts, reviewers will make mistakes, so it would be unwise to rely on a single reviewer when rendering a judgment. Moreover, some aspects of reviewer judgment, like the determination of the importance of a result, are very subjective. For both reasons, peer review employs more than one reviewer in making a final judgement. POPL and PLDI aim for at least three reviews, oftentimes four, which I note is more than the two reviews often employed in top journals in other areas of science.

Program committees. Instead of soliciting reviewers for each paper on an ad hoc basis, as is done with journals, a conference has a program committee (PC) that reviews all papers submitted to the conference. Program committees confer several advantages:

  • Committee members review many papers in a concentrated period, and so they can develop a sense of quality when judging papers (and get a wider sense of the happenings in the field).
  • Committee members are chosen carefully, in advance, from a diverse population, and provide valuable input (e.g., by “bidding”) to the process of assigning papers to reviewers. Ad hoc reviewers selected by journal editors might inadvertently be drawn from a narrower population (e.g., famous people immediately familiar to the editor). On the other hand, a PC may turn out not to have sufficient expertise in a particular area; in this case external reviews are sought to fill in the gaps.
  • Papers are discussed, either in person, or on-line, by the reviewers, the PC Chair, and possibly other members of the committee, in the context of the whole program. Such discussion leads to better outcomes and keeps reviewers honest (a shoddy review will be seen by one’s peers). By contrast, the decision maker in journals is typically an associate editor (AE), whose judgment is based on the original reviews but without discussion. As an AE for Transactions on Programming Languages and Systems (TOPLAS), I often ask disagreeing reviewers to discuss a paper, inspired by conference-based review.

Which process is best?

The details of peer review processes can generate a healthy debate. We should be glad for this, because as consumers of scientific results we rely on peer review judgments to be good ones. We can feel more confident in a result if we feel confident about the process that produced it.

Single- vs. double-blind review. Not everyone agrees on that double-blind review is worth it. As program chair of POPL a few years back, I followed Kathryn McKinley’s lead for PLDI and pushed to use double blind review. My feeling is that this approach does indeed reduce bias and increases the quality of judgments. On the other hand, double-blind reviewing can be a burden on authors. For example, making the paper anonymous could force some contortions (e.g., if a researched system is well known), and authors may be restricted from talking about their results (e.g., at an interview) until the review process completes. To overcome these costs, we employed a light form of DBR that we hope reduces bias and burden; and reviewers and authors seemed to feel it worked, as detailed in my Chair report.

Journal vs. conference reviewing. Conference-based review and publication has disadvantages. Because they read many papers in a short period, reviewers may spend less time per paper than they would for a journal submission, which means they may miss important details. Conference paper page lengths are restricted, which can encourage better writing quality, but can also hurt it when authors cut helpful examples to save space. (Does it really make sense to limit page lengths in an age of on-line dissemination?) And there often is no reviewer followup to make sure issues are actually fixed (though SPLASH/OOPLSA is currently experimenting with two-phase review process). Many people are unhappy with this situation. Alessio Guglielmi characterizes the problems well (and links to other thoughtful points of view, e.g., from Matthias Felleisen and Moshe Vardi).

Science of peer review? One way to settle these debates would be to use science.[ref]It would be interesting if the paper that was published assessing the process might be subjected to a process that is sub-par, according to the paper.[/ref] Unfortunately, this is very hard to do. For example, we could imagine comparing the outcomes of conferences that do and do not use double-blind review, and seeing whether one tends to bias toward toward (say) more famous, male authors at well-known institutions. But such a conclusion would be hard to distinguish from random chance because of the many other variables involved, e.g., differences in the papers considered, differing reviewers, and differences in other details of the review process. A more controlled study would be to have two committees review the same papers, one set blinded and one set not. But such a study would be incredibly costly, and the difference in reviewers might still have more effect than SBR vs. DBR. A believable, lightweight approach to assessing different processes would be incredibly valuable.

Calling on good reviewers

As my colleague Peter Sewell pointed out in his POPL’14 report, whatever the process used, in the end it only works when it involves reviewers who are trying, in good conscience, to render thoughtful and informed judgments.

Good reviewing is not easy. It requires taking the time to read papers carefully, being informed about the area, knowing when problems are fundamental as opposed to when they shouldn’t stand in the way of publication, and taking the time to write constructive feedback. The golden rule applies: How would you like your paper to be reviewed? Apply your answer to the papers you review yourself. Take sufficient time to do your reviews, and don’t say ‘yes’ to so many reviewing duties that you can’t do a good job.

In a future post, I’ll present some advice on writing good peer reviews, based on my experience as a reviewer, conference chair, and editor.

The PL community is blessed to have a culture of good reviewing. This is a good thing: Peer review is the heart of the scientific process—it is a gateway for new ideas and the foundation of our trust in published results, some of which go on to have a big impact in our lives.

19 Comments

Filed under Process, Science

19 Responses to Peer review, and why it matters

  1. Here’s a recent study from CRA on the trends in how computer scientists are evaluated, with (an excess of) conference papers potentially playing a negative role.

  2. Interesting result, Aws, thanks for sharing! If you haven’t seen it, Snodgrass has a long review of DBR, with some consideration of gender bias. But the review is now 8 years old. http://www.cs.utexas.edu/users/mckinley/notes/snodgrass-sigmod-2006.pdf

  3. Something I wouldn’t have guessed before being asked to do reviews myself is how hard it is to make a review. (I knew it would also be extremely interesting; it’s very nice to be a reviewer if you don’t have too many papers to review.) You mention the effort of reading the paper in depth (and the annexes, and sometimes the related work you need to look at more closely to compare), and the knowledge requirements, but I found the hardest part was to actually *judge* the paper. Sometimes it’s easy (you find the thing horrendous, or on the contrary it’s the greatest paper you’ve read this year), but in my experience a large part of the review work is to actually force yourself to come to a conclusion: what do I actually think of this paper?

    I’ve read some “how to do good reviews” advice but it often feels rather hollow. “Write reviews as you would like to get back from your own submissions” is good, but it’s more about the form (it’s not easy to write a negative review that is still nice, and you know the author will be disappointed angry inside however you write it, if only just for a moment). Another advice is to avoid neutral notes (if notes go from -2 to +2, avoid 0 at all cost). But I haven’t read much advice (if that exists) on how to actually formulate a judgment on a paper — besides validity checking.

    I think three things could help writing better reviews:

    – a checklist of errors not to make when you write a review
    – a checklist of questions to ask oneself to make a judgment; the questions in the review interfaces of conferences sometimes help (eg. one conference asked “how would you rate the importance of the problem attacked by this paper?” and that’s actually something you may forget to wonder about during your review if you’re focused on something more vague like “do I like this paper?”), but I think we could have centralized, more complete checklists. (This would probably help writing the papers as well!)
    – reading reviews written by other people; unfortunately the practice of publishing one’s review is not accepted right now (and reviewers may not consent to this), so for beginners that are <not yet in program committee the only source of reviews is our own submission

    • Great comments! Later this week I expect I’ll post my advice on writing reviews. Some of your points are addressed in my draft, so I’ll be curious as to your comments once it’s out.

      One quick comment: I’m not opposed to neutral reviews. It may be that the paper is really well written, solid and correct, but not that interesting (to you). A weakly positive review is perfectly appropriate. If all the reviews are (only) weakly positive, then the paper is probably not a good one to accept. Inoffensive, but not something that will move the community. The key is to be systematic and honest, and then let the process play out, IMO.

    • Is there something taboo against publishing your reviews? This is never stated in the call or instructions for authors, nor is it common knowledge if one of those “silent” rules everyone follows. Publish your reviews; consent from the reviewer is not necessary, especially if you have no idea who the reviewer is! It is also great from a transparency standpoint: in the best case, it helps us understand what the PC is looking for; in the worst case conference communities should be held accountable for their collective quality of their reviews.

      • It’s an interesting idea. Some conferences do publish reviews. Andrew McCallum has been hosting OpenReview which makes clear that certain forms of openness, like publishing reviews, can happen, and this system was used for ICLR’13. SIGCOMM and other systems conferences have published “public reviews” (e.g., here for HotNets III, and here for SIGCOMM’13) but I’m not sure how these relate to the actual peer reviews.

        I imagine one danger of publishing reviews is deanonymization. NLP techniques are pretty good these days.

        • Systems seems to be more open than PL in this regards (but we haven’t even gone open access yet so….); EuroSys provided mini-reviews for each accepted paper last year (not for 2014 though). Chairs should be looking at ways in providing for more transparency in why papers are accepted and rejected. As a bonus these reviews are also useful in promoting the accepted papers.

          What might be a cool experiment is something like a published comment section for an accepted paper. Consider the interesting debate recorded at the end of Dijkstra’s “Go To Considered Harmful.”

      • This is a controversial point. If publishing your reviews without any previous agreement with the reviewers was an accepted practice, I would be glad to do it. Although some do that (and I’m grateful to them because it’s always interesting to read reviews), it is not an accepted practice; when I discussed it with my colleagues, some choked at the idea of publishing other people’s review.

        While that may be harmless for an established researcher, a PhD student is not in position to be controversial about his or her practice of research. I decided to not publish my reviews unless I got explicit consent, which means that the PC needs to notify reviewers (being a reviewer for this PC implying consent), or at least give them the choice and ask them to confirm consent in their reviews directly, which some people do in any case.

        • Be bold. Cultural norm changes only happen if someone makes the first step. PhD students are given latitude to experiment and break conventions; it is us established researchers who would get pegged for it (if anything, no one has ever cared in my case).

          • I object to your exclusionist assertions that “Peer review is the gateway” and
            that “publication of a scientific paper is THE first important step of establishing the validity of the work”

            If you restrict peer review to judging validity and novelty of claims in a paper, I will not object to your claim.
            But as soon as you introduce highly subjective factors like “importance” and “whether it is well presented, i.e., so that it can be understood by the community that the publication venue represents.” into the acceptance criteria (often the case for PL conferences), you should not call peer reviewers as gatekeepers of science. In my opinion, that misleads the public:
            https://pbs.twimg.com/media/C3AohZuUoAQXZJy.jpg

            Nowhere in the definition of science is the requirement of being interesting to an ill-defined set of humans.

            There is some evidence to suggest that (in other areas) those subjective factors have led to systematic suppression of entire research directions. For example, the Chronicle [1]
            documents Geoff Hinton’s, Yann LeCun’s, and Yoshua Bengio’s struggles with getting their papers published on ideas that are now often called deep/convolution nets. (In a colloquium talk at Cornell, last year, Yann Lecun half-jokingly said that now the same conferences won’t publish anything that is not related to deep nets.)

            A similar phenomenon happened in nutrition science [2]. This one seems to be far more catastrophic: it lead to a loss of at least thousands of lives due to diseases like diabetes.

            [1]: http://www.chronicle.com/article/The-Believers/190147
            A non-paywalled version may be found here http://weibo.com/p/1001603814055260359965?from=page_100505_prof

            [2]:
            https://www.theguardian.com/society/2016/apr/07/the-sugar-conspiracy-robert-lustig-john-yudkin (long but worth reading till the end)

          • Thanks for your comment. You write it at a time that I’ve been thinking more about “interest” and “validity” in peer review.

            My concern is that we are falling down on assessing the second, by allowing the publication of papers without proper empirical evaluations, proofs, etc. Based on preliminary investigations, I find that many papers use invalid methods, and thus may not actually achieve what they claim to. The reason such papers are nevertheless accepted might be due to the expectation or even elevation of subjectivity in review: Subjective views of the paper’s topic’s “importance” (for example) may inadvertently overshadow consideration of its results’ correctness.

            The justification for “importance” is to filter papers that are more likely to be of interest to readers, who have limited time. I concede that this is a potentially dangerous practice. What do you suggest?

            I disagree with you that writing/presentation should not be considered during review. Research is not research unless it communicates knowledge. A paper that is opaque due to poor writing is not communicating knowledge. Writing quality is something that trained reviewers should consider, just as they consider the empirical methods and other technical elements of the result. Poor writing is, in my experience, the reason many bad papers are accepted, paradoxically. My guess is that badly written papers have worse overall impact. (Some journals force writers into a particular format in the hopes that doing so will make the presentation clearer. I’m skeptical that this works.)

  4. Pingback: Advice on reviewing papers | The PL Enthusiast

  5. Pingback: PL conference papers to get a journal? - The PL EnthusiastThe Programming Languages Enthusiast

  6. Pingback: Carbon Footprint of Conference Travel - The PL Enthusiast

  7. Pingback: Unblinding Double-blind Reviewing - The PL Enthusiast

  8. Pingback: Measuring Single vs. Double-blind Reviewing - The PL Enthusiast

  9. (@Michael Hicks: I don’t see any reply button next to your comment. So, this post may appear at an incorrect position in the thread. Feel free to ask me to delete it and repost it at the right place.)

    I like Jeremy’s idea [1] of decoupling publication and presentation.
    Peer review, which is widely believed to have the role of gatekeeping science, should be only concerned with judging validity and novelty. All papers whose validity and novelty have been ascertained (upto “reasonable” confidence) by reviewers MUST be accepted for publication. However, not all such papers need to be invited for presentation. The PC committee can vote/debate and choose what is important enough for presentation. For the readers who don’t know what is important, conferences can publish some kind of “importance score” for papers accepted for publication (not just for the papers accepted for invitation). This importance score, which need not be a scalar, can account for other subjective factors too, like “accessibility to a wide audience”.
    However, it should be made abundantly clear that the invitation to present and the importance score are independent of validity.

    In accordance with the above, the “identify the champion” policy should be scrapped in favour of “identify the (in)validator”. The PC chair should strive to find at least 2 experts who understand the paper in sufficient depth and can confidently agree about the paper’s validity or invalidity.

    To judge validity, it is not necessary that all of the details of a paper are understandable to a “broad community”, especially to non-experts. It is fine if even expert reviewers cannot understand a proof that has been mechanized, as long as the reviewers can at least ascertain that the lemma statement in the mechanization is what the paper claims.
    (Just because 2 experts cannot (or don’t have the time to) understand something doesn’t mean that no one can understand it. Depending on their background, different people often have different intuition about the same formal proof. Validity is not a concern anyway for machine-checked proofs.)

    I agree that it is important for research to be easily understood by others in “the community”. I firmly believe that a wide audience should be able to understand the main ideas of a paper (at least abstract and intro). However, I don’t think that a gatekeeper of science should (or even can, reliably,) judge this aspect. It is fine to have these aspects in the criteria for presentation invitations and importance score.

    [1]: http://siek.blogspot.co.at/2014/01/the-publication-process-in-programming.html

  10. Pingback: Evaluating Empirical Evaluations (for Fuzz Testing) | SIGPLAN Blog

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.