Consider this claim
Quality is more important than quantity
I expect few people would disagree with it, and yet we do not always act as if it were true. In Academia, when considering candidates to hire or promote, we count their papers, their citations, their funding, their software download rates, their graduated students, the number of their committee memberships or journal editorships, and more.
Researchers are getting the message: quantity matters. Ugo Bardi proposes the economic underpinnings of this apparent trend, cleverly arguing that scientific papers are currency, subject to phenomena like inflation (more papers!), assaying (peer review validates papers, which support funding proposals, which finance more papers), and counterfeiting (papers published without review by unscrupulous publishers). Moshe Vardi, in a recent blog post, concurs that “we have slid down the slippery path of using quantity as a proxy for quality” and that “the inflationary pressure to publish more and more encourages speed and brevity, rather than careful scholarship.”[ref]Update 8/21/2016: As more evidence of the problem, here’s a great retrospective from the editor of a top journal in sociology points to quantity greatly devaluing quality.[/ref]
In this post I consider the problem of incentivizing, and assessing, research quality, starting with a recent set of guidelines put out by the CRA. I conclude with a set of questions—I hope you will share your opinion.
Encouraging quality by reducing quantity
To help address this problem, Batya Friedman and Fred B. Schneider recently published a CRA report, Incentivizing Quality and Impact: Evaluating Scholarship in Hiring, Tenure, and Promotion. It was developed over an 18-month period by a diverse Committee on Best Practices for Hiring, Promotion, and Scholarship, which Friedman and Schneider co-chaired.
They make a simple suggestion: Take quantity out of the evaluation process by asking candidates to put forward a handful of papers for careful consideration, rather than the entirety of their CVs. In particular,
- For hiring, identify 1-2 pubs “read by hiring committees in determining whom to invite to campus for an interview and, ultimately, whom to hire.”
- For promotion, identify 3-5 pubs where “Tenure and promotion committees should invite external reviewers to comment on impact, depth, and scholarship of these publications or artifacts.”
The goal is to “focus on producing high quality research, with quantity being a secondary consideration.” Viewing papers as currency, this recommendation aims to combat inflation by fixing prices. I like the idea. Importantly, it implies that we can determine “quality” by a careful, direct examination of a handful of papers (assessing “impact, depth, and scholarship”), rather than an at-a-distance examination of many papers. I think that doing so can be challenging.
What features does a high-quality research paper have? I can think of several:
- Problem being addressed is important
- Approach/description is elegant and/or insightful
- Clever/novel techniques are employed
- Results are impressive
- Methods are convincing; e.g., robust experimental evaluation, proof of correctness (perhaps mechanically verified), etc.
Program committees are often instructed to look for these features when deciding whether to accept a paper. Of course, not all must be present. For example, one of the tenets of “basic” research is that some problems may have no clear application as yet, so we must judge them using other features.
The above elaborate on the CRA’s recommendation of “depth” and “scholarship;” the third thing they mention is “impact.” In a sense, my list of features comprises intrinsic judgments of a paper, whereas impact is an extrinsic measure of how things played out. A commonly used judgment of impact is citations: If a paper is cited a lot it has evidently impacted the research world. We might assume that it did so because it exhibited some or all of the intrinsic measures above. Another measure is adoption; e.g., if the results of a paper are incorporated into a major industrial effort or product, then that would show that the problem was, indeed, important, and the results were convincing. There are many others.
Assessing impact (or its potential) is desirable because intrinsic features are telling, but not necessarily discerning. We can imagine that many people will write intrinsically good papers, but few of these will be significantly “impactful.” We would like to hire/promote those researchers with a penchant for impact.
But assessing impact is also difficult. For one, it takes time—perhaps a long time. A paper I co-authored on the safe manual memory management scheme in the Cyclone programming language appeared in 2005, but the emergence of Mozilla’s Rust programming language, which incorporates many of the ideas in that paper, didn’t occur until recently, well after my tenure case. We can also imagine that impact changes with time; e.g., ideas once viewed as groundbreaking can lose their luster (one example that comes to mind is software transactions for concurrency control).
Less qualitatively, citation counts are not always a good proxy or predictor of impact. My Cyclone paper was not cited much when I went up for tenure, and still hasn’t been. (Fortunately, my tenure case did not rest on this one paper!) As a more high-profile example, Sumit Gulwani‘s FlashFill work was incorporated into Excel — this is a major impact. While the paper is cited a respectable 145 times in 5 years (as of 4 Nov 2015), it is certainly not among the most cited of that timeframe (e.g., consider the sel4 paper).
Nevertheless, it is hard to get away entirely from measures like citation counts. While citation counts can mislead, we know from the literature on behavioral economics that qualitative human judgment is not completely trustworthy either. The book Thinking Fast and Slow, by Nobel Laureate Daniel Kahneman is particularly eye opening. Among many other interesting results, the book shows that people can be powerfully persuaded at the start of an evaluation process by superfluous details (which is a reason I support light double-blind review); that they will carry over a positive impression about one aspect of a person, based on good evidence, to another aspect of that person for which they have no evidence; that being heavily invested in a project or movement clouds judgment about that project’s prospects; and more. Therefore, quantitative measures offer an important data point, even if that point cannot be completely relied upon.
Improving these measures while not neglecting direct, thoughtful assessment (as per CRA’s recommendations), seems useful. For example, rather than treating all citations as equal, we could look at the citing text to understand whether references are positive or negative and what their purpose is, e.g., as a comparison to related work or as a reference to a prior technique being employed. We can thus construct a more nuanced picture. We could even use NLP techniques to automate the process.[ref]More radically, we might wonder whether NLP techniques could examine the text of a whole paper and then try to predict impact, down the line, assuming you choose an impact measure like citation counts. My colleague, Hal Daumé III, pointed me to a paper that does this, but adds “I guess I’m slightly dubious of automating fine-grained predictions because I think we as humans are pretty bad at making them (eg when deciding what papers to accept/reject). I totally think we can do coarse-grained automatically (at a sub-sub-sub-field level) but individual paper is hard.”[/ref]
How to promote high quality PL research?
I will close with three questions.
First, which Departments are actually adopting the CRA recommendations, at least in some form? My understanding is that Cornell is adopting them, which would make sense given that a Cornell professor co-authored them. Others?
Second, what PL papers serve as models of quality and impact? There are several lists of “classic” PL papers, such as Pierce’s 2004 survey, Might’s top-10 static analysis papers (scroll to the bottom), and Aldrich’s classic papers list. One of my favorite papers is Wright and Felleisen’s A Syntactic Approach to Type Soundness. As far as papers exemplifying depth, clarity, convincing methods, and impact go, it’s hard to beat this one. What are your choices for great paper, and why?
Third, aside from limiting the number of papers considered at hiring/promotion time, are there other ways to incentivize great research? One boring, but potentially effective thing we could do is to have higher, and clearer, standards for review. For example, a complaint I have heard a lot is that the experiments done in PL papers are sketchy: They use poor statistics and/or cherry-picked benchmark suites. Thus we could imagine asking reviewers to more carefully confirm that good methods are used.[ref]Raising review standards may be problematic in PL’s conference-oriented culture, since we might like to accept flawed/incomplete papers to conferences that report potentially path-breaking ideas. Even if we left official review standards alone, the CRA guidelines might serve as a counter-incentive, pushing authors to write less flawed, more complete papers, since they know fewer papers will be considered when they are evaluated.[/ref] Another idea would be to nudge people towards certain types of quality with explicit rewards, e.g., best paper awards with clear criteria: “Best idea paper” or “Best empirical paper” or “Best tool paper” or “Best new problem paper”. What else could we do to incentivize quality (and make it clear when we’ve seen it)?
Promoting research quality is extremely important for the success of science and all who rely upon it. How can we do better?