Anyone familiar with American academia will tell you that the US News rankings of academic programs play an outsized role in this world. Among other things, US News ranks graduate programs of computer science, by their strength in the field at large as well as certain specialties. One of these specialties is Programming Languages, the focus of this blog.
The US News rankings are based solely on surveys. Department heads and directors of graduate studies at a couple of hundred universities are asked to assign numerical scores to graduate programs. Departments are ranked by the average score that they receive.
It’s easy to see that much can go wrong with such a methodology. A reputation-based ranking system is an election, and elections are meaningful only when their voters are well-informed. The worry here is that respondents are not necessarily qualified to rank programs in research areas that are not their own. Also, it is plausible that respondents would give higher scores to departments that have high overall prestige or that they are personally familiar with.
In this post, I propose using publication metrics as an input to a well-informed ranking process. Rather than propose a one-size-fits-all ranking, I provide a web application to allow users to compute their own rankings. This approach has limitations, which I discuss in detail, but I believe it’s a reasonable start to a better system.
Do we need rankings at all?
Many would argue that the idea of department rankings is inherently problematic. Any system of comparison must evaluate a department’s research quality using pithy metrics. To do this to a creative activity like research is reductionist and possibly harmful.
A glib rebuttal to this is that the genie is already out of the bottle. The marketplace has spoken, and it clearly likes university rankings. By not starting a conversation about a better system of ranking than US News’s, we reward the status quo.
A better justification is that department rankings are a valuable service to prospective students. Matching prospective graduate students to programs is a problem of resource allocation in a market. However, this market has information asymmetry, because students don’t have a clear idea of what makes for a good Ph.D. experience. When courting prospective students, universities put on shows that have limited connection to the reality of the graduate research experience. The problem is worse for international students, who frequently join universities sight unseen. As a result, it is easy for students to make suboptimal choices when selecting a program to join. At their best, university rankings help students make more informed decisions.
Ranking by objective metrics
What would a fairer system for ranking computer science departments look like? It seems to me that any such system should depend, in part, on real data on research productivity. The problem, of course, is that “research productivity” is a fuzzy concept. Efforts to approximate it using “bean counting” measures like paper or citation counts, or grant dollars, have basic shortcomings.
However, I think that these approximations have some value, especially when seen from the point of view of the end users of department rankings. Presumably, a prospective student would want to have a strong CV at the point when she finishes her Ph.D. She has a higher chance of doing so if the group she joins publishes regularly at top-tier publication venues in the areas in which she is interested. She is more likely to stay funded if her advisor has a track record of bringing in grant money. She is more likely to do highly cited research if her advisor has more highly cited papers than others in the same research area and level of seniority.
All in all, objective metrics have limitations, but also produce some useful signals. One could, at the least, make them a factor in computing department rankings, even using a reputation-based system. For example, survey respondents in a reputation-based rankings could choose (or be asked) to use productivity-based rankings as an input in their decision-making process. Presumably, this would lead respondents to make more informed judgments.
Interactive ranking: putting the user in charge
Another issue with existing ranking systems is that they are static, one-size-fits-all solutions. Think of a prospective student who is interested in the interface of Programming Languages (PL) and Machine Learning (ML). He should probably pick a department that has been active in PL and ML in recent times, and an advisor who has a strong track record in at least one of these areas. Depending on his interests, he might want an advisor who is primarily a PL researcher but also collaborates with ML folks, or the other way around. Finally, he may want to work with professors who are at a certain level of seniority, or whose students and postdocs have gotten high-profile research jobs. Unfortunately, current ranking systems do not support such nuanced decision-making.
Maybe what is needed, then, is a rankings application that can be customized to different user needs. For example, such an app could let a user assign weights to the different subareas of computer science, and rank departments and advisors by their weighted strength in these areas. The system would be fundamentally interactive: users would be able to change the weights on various variables and observe changes to the ranking results.
An interactive ranking system based on publication productivity
Over the last few weeks, I have coded up the first draft of such a ranking app. The rankings this app computes are based on a single objective metric: publication counts at top-quality outlets. The reason why I only used this metric is simple: data on which researcher publishes where is available from the DBLP bibliography database, and this data can be used to compute paper counts. On the other hand, I had no easy access to data on citations or funding or the history of a department’s former students. I believe the ranking system is defensible, as acceptance at top venues correlates, at least to some extent, with quality of research. However, by definition, it considers one dimension of a complex, multidimensional space. One might extend the app to allow ranking using a richer set of features.
DBLP doesn’t track institutions of researchers, but Alexandra Papoutsaki and her coauthors have recently developed a listing of faculty members at 50 top US universities. By cross-linking this dataset with DBLP data, one can determine where professors in a given department publish.[ref]Seddighin and Hajiaghayi have recently developed a ranking of departments by publication productivity in Theoretical Computer Science. Also, Remzi Arpaci-Dusseau developed a ranking of systems researchers by productivity a couple of years ago.[/ref]
I used feedback from colleagues and friends to select a few top-tier publication venues in several subareas of computer science (see here for more details). For example, the top venues in Programming Languages (PL) are the Symposium on Principles of Programming Languages (POPL), the Symposium on Programming Language Design and Implementation (PLDI), the Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), and the International Conference on Functional Programming (ICFP). The top venues for Algorithms and Complexity are the Symposium on Theory of Computing (STOC), the Conference on Foundations of Computer Science (FOCS), and the Symposium on Discrete Algorithms (SODA).
The application’s interface allows the user to select a time window within the last 15 years and to put weights on various areas. The app assigns a score to each professor by giving him or her w points (w is a number between 0 and 1) for each paper in a top-tier venue in an area with weight w. We also identify a set of relevant professors — intuitively, the set of professors who are prospective advisors in the areas of interest. To qualify as a relevant professor, a faculty member must have published 3 or more papers across top venues in areas where the user has put a nonzero weight, within the selected period. Departments are now ranked according to three different metrics.
1) Aggregate productivity. In this measure, the score of a department is the sum of the scores for all its professors. A department that scores high on this measure is likely to be a high-energy research environment, with a culture of publication in strong conferences in the areas of interest. However, this metric is likely to favor larger departments over smaller ones.
2) Maximal productivity. Here, the score of a department is the greatest score received by one of its professors. Unlike aggregate productivity, this metric is not directly affected by a department’s size. A justification for this measure is that at the end, a prospective student needs only one advisor. Consequently, joining a small department with one prolific researcher can possibly beat joining a larger department with multiple less productive researchers.
3) Group size. This metric estimates the size of a department’s group in the students’ areas of interest, by counting the number of relevant professors that it employs. This statistic puts the productivity rankings in perspective. Arguably, it is also of intrinsic interest to prospective students. Larger groups allow more courses, seminars, and interactions with fellow-students. On the other hand, some students prefer the cosiness of small departments, which are likely to score poorly on this metric.
The purpose of this application is to allow interactive exploration of data, and different users will draw different conclusions from this exploration. So, rather than present any results, I invite you to use the application yourself!
Limitations and conclusion
As mentioned earlier, this ranking application is meant to be a first draft rather than the final word. Given that publication in selective venues is key to success as a researcher, I believe that the app produces some useful information. However, any ranking based on objective metrics has limitations, and this system is even more limited in considering just one measure.
The app has some implementation issues as well. The roster of faculty members used here was generated through crowdsourcing and is not guaranteed to be free of bugs. Also, linking faculty members in the roster to DBLP records isn’t always easy: a professor’s DBLP entry may use a name that is slightly different from their name in the roster, and some professors appear under multiple names in DBLP. I used a heuristic to join the two data sets while correcting for common variations of names, and also manually examined the data, but this is hardly a failsafe process.
However, I am hoping that the wisdom of the crowd can be used to overcome some of these limitations. The code and the data for the app are freely available (see here for more details). I have created a public Google document to collect information about errors in the data; please leave a comment there if you find bugs. You are also welcome to extend the app with additional features and productivity metrics.
Going beyond this particular app, I think we need to start a conversation about how to help prospective students make better decisions about where to go for graduate school. For a long time, we have allowed entities who do not have a stake in our discipline to rank our graduate programs. These rankings have dubious methodology, and they also have real implications for our departments. At the same time, we must recognize that they fill a real need. Creating nuanced, data-driven approaches to school ranking is a constructive way of challenging their hegemony.
[This post has been updated since it was first published.]
10 Responses to Ranking CS Departments by Publication Productivity, Interactively
The problem with ranking (or any decision based on metrics) is that they increase the incentive to optimize one’s action toward the specific metrics chosen, instead of actually doing good work. I don’t think computer science research needs more pressure to optimize on the *number* of papers accepted at elitist conferences; in this respect, it seems that the previous ranking method was “less harmful”.
You could argue that being a fairer measure of actual truth also does good, and that the negative effect of pushing people to farm out papers at greater speed (which is hard to measure) is compensated by the positive effect of having students make better choices (also hard to measure). That seems rather difficult to argue given that none of the measures proposed so far (and, I expect, none of the measures proposed to tweak exploration this particular dataset in the future) actually answer the question of which graduate programs have professors that are available for their students, are enjoyable to work with, and will effectively teach how to be a good researcher (or other skills they are interested in when coming to graduate school).
I agree. These tools are often fun to play with. However, if our goal is to help students make better decisions about where to go, we need to do more. Even setting aside debates about whether number of papers at top conferences is a good measure of research productivity, I think gasche is right that research productivity is not even close to being the factor that is most relevant to graduate students.
Why not focus on metrics that more directly relate to the experiences of students? I can think of several that are more relevant: median time to graduation, attrition rates, placement rates in industry/tenure track positions at research universities/teaching colleges, current student and alumni satisfaction, funding reliability, teaching load, percentage of students who are forced to change institutions or advisors because a faculty member leaves. The students I know from a number of institutions tend to be unhappy with their experience because of things related to the above, not because they think their advisors do not produce enough papers.
Of course, these metrics are just not available. Even when universities collect them, they rarely make them public, even to prospective students. As gasche says, all metrics are susceptible to gaming and manipulation, and these are no exception. For example, law schools have been known to game placement rates by giving graduates low paid “fake” jobs. However, I still think these are much more robust and useful than the standard bibliometrics. If, as a field, we really care about improving student outcomes, we should strive to collect these statistics and report them honestly.
Thank you for the comments.
I think we should make a distinction between two questions: (1) whether objective metrics are desirable at all; and (2) whether paper counts form a reasonable objective metric.
Regarding the first, I think that proper evaluation demands a combination of human judgment and data, and this is really the point of the post. Yes, metrics have their limitations, and these need to be understood by those who use them. But humans have insidious biases as well.
We also have to remember that not everyone has access to enlightened human judgment. I personally had a lot of experience with undergraduate research, and my choice of grad school and grad advisor was shaped by this experience. However, not every student has this sort of privilege.
Anyway, your concerns seem to be with (2) — the particular metric of paper counts. I have two responses to that. First, I think that you are overestimating the impact and scope of this app. All I have done is present some data, with appropriate caveats. As much as I love the PL enthusiast, I don’t think researchers will start producing more junk papers just so their departments score higher on my app. Also, top-tier outlets are top-tier for a reason; just because you submit more papers doesn’t mean those papers will get in!
Yes, there are researchers who do extraordinary work but don’t publish a lot, and this metric does discriminate against them. This is why its use needs to be complemented with human judgment, when available. (I would add that human recommenders often discriminate against those who do extraordinary work, publish at strong places, but who aren’t socially/academically connected to the recommender.)
Second, yes, this app doesn’t capture, as you say, “which graduate programs have professors that are available for their students, are enjoyable to work with, and will effectively teach how to be a good researcher (or other skills they are interested in when coming to graduate school).” As a scientist, my question then would be: how can we objectively model/approximate these questions? Or must we shy away from objective analysis just because we can’t be perfect at it?
As I said in the post, I would like to see a fuller picture of research productivity, one that accounts for citations and impact. But I would also love to have metrics that complement research productivity. For example, in his comment below, Joseph lists some criteria: placement rates, teaching load, funding reliablity, median time to graduation, and so on. These are tangible measures, and I love them! Yes, data on these is hard to find, but if we know what kind of data to look for, we have already made progress. Those of us who care about this issue can then pressure our departments to collect and publicly report this data. This effort may not be successful any time soon, but we have to start somewhere!
It’s disappointing that this article couches itself in the language of
“objective metrics” but then slips in a highly subjective judgment
that harms many in the programming language community: it classifies
PLDI and POPL as “flagship” and OOPSLA and ICFP as “others”. PL is
the only area that’s broken down into subgroups with qualitative
modifiers, despite other areas such as Architecture having as many
venues (4) as PL.
In response to my complaining on twitter, the author pointed out that
“To reduce the difference to mere syntax, all you have to do is to put
equal weight on the two categories.”
But this is not mere syntax! Sure, we get one slider for PLDI/POPL
and another for OOPSLA/ICFP, but the labels “flagship” and “others”
together with the default settings that emphasizes “flagship” over
“others”, carries a strong value judgment: PLDI and POPL count for
more than OOPSLA and ICFP. That’s semantics, not syntax. (Let’s
leave aside that there are many folks—I’m not one of them!—within
the OOPSLA community who be unhappy sharing a slider with ICFP, and
Moreover, the author spells this out in several places. In the body
of the article it says “the top venues in Programming Languages (PL)
are the Symposium on Principles of Programming Languages (POPL) and
the Symposium (sic) on Programming Language Design and Implementation
(PLDI).” Full stop. No mention of OOPSLA or ICFP. On the
methodology page it states:
For one area, Programming Languages, we considered two categories
of venues: Programming Languages (flagship) and Programming
Languages (others). The latter category consists of two conferences
that are not flagship venues, but publish strong results.
So ICFP and OOPSLA publish strong results but are not “flagship
venues”. But what does that even mean? Is that a distinction that’s
made by the SIGPLAN? As far as I can tell, it’s not. The term
“flagship event” is used in Jan Vitek’s 2013 SIGPLAN Chair report, but
the implication is clear: it refers to POPL, PLDI, *and* ICFP, and
OOPSLA (https://dl.acm.org/citation.cfm?id=2502508.2502510). If we
look at nominations to CACM Research Highlights, it seems pretty
evenly spread between the four OOPSLA, ICFP, PLDI, and POPL, plus many
coming from more specialized SIGPLAN venues like PPoP and the Haskell
This does real harm to a substantial portion of the PL community. For
those who publish more in OOPSLA and ICFP, they will have a harder
time finding jobs and seeking promotion. While we insiders all have
our own preferences and prejudices and can calibrate accordingly
(reducing the distinction to mere syntax), department chairs, deans,
and colleagues from other areas will take this (completely
unsubstantiated and subjective) judgment as gospel. Who would want
to hire or promote a researcher that doesn’t publish in the “flagship”
venues, but instead meddles in the lowly “other” venues? After all,
this is the “PL Enthusiast” and these are highly respected PL
researchers making this distinction. Surely it is a correct and
universally held opinion that POPL and PLDI are “the top venues”,
I’ve reviewed papers for POPL, OOPSLA, and ICFP. I’ve published at
ICFP, PLDI, and OOPSLA. I’ve attended all of them. In my experience,
I see no discernible difference in quality. Sure, there are some
cultural differences, but even then, it’s only a matter of degree. In
my opinion, the best kind of research is that which could published at
any of them. Others surely disagree and think ICFP is the bees knees,
or PLDI is the only “real” vanguard venue. That’s fine, but it’s a
subjective matter of opinion. We shouldn’t pass of these kinds of
judgments under the banner of “objective metrics.”
Our field is large and diverse (at least on technical matters). It’s
a core and (relatively) mature area of computer science. Let’s let
at least four flowers blossom.
Thanks for your careful and thoughtful reply, David.
You are right that the judgment that PLDI and POPL are “flagship” is a subjective one. And of course you are right that great papers appear in ICFP and OOPSLA, just as they do in POPL and PLDI. I’ve published in all four venues (and served on their PCs) and enjoyed reading great papers in all of them. Going further, the same statement of quality is true of other PL venues, like ESOP, ECOOP, and TOPLAS (a journal!). And why stop at PL? I’m sure that equally valid criticisms could be leveled against the choices of “flagship” conferences in other areas (e.g., why PODS and not ICDE, for databases?).
We could rectify these problems by including all venues, making a slider for each, and letting each user pick the combination they like best. But I also think that the approach taken, which is basically to view PLDI and POPL as a sample of the whole area, is not unreasonable, to keep things simple. Certainly no harm was meant by it. And even more certainly there was no intent to imply that the choice of POPL and PLDI as representative was somehow determined objectively, or that papers in these venues strictly dominate OOPSLA and ICFP papers. Perhaps the use of the word “flagship” was poor, if it gave these impressions.
We could also rectify the problem by performing no analytics at all. No matter which set of conferences you pick, and what weights you assign to them, it is easy to argue that the metric you compute will be wrong in many ways, on its own. But I believe that metrics like the one offered here are a useful sanity check, to avoid completely subjective bias (when tends to cement existing ranks). Are we better off with no data at all, and just impressions? I don’t think so. Moneyball showed decisively that analytics can reveal the error of subjective judgment. But we also know they do not tell the whole story, and therefore cannot replace judgment entirely, especially when the final goals and outcomes are hard to even state, as is the case in research. I would suggest a useful approach to metrics like the one proposed here is to use them as a starting point for more in-depth investigation. If you are considering a graduate school, or a candidate to hire or promote, then read their papers and learn about the people. Stopping at highly lossy metrics is frankly irresponsible.
If the above has not made my opinion clear, let me emphasize: The analysis made possible by Swarat’s tool was never intended to be a final, authoritative statement. It was intended to be a piece of a larger picture, and in its current form it’s really a starting point in need of thoughtful comments and refinement. Thanks for providing your opinion! There will most certainly be followup work.
“Stopping at highly lossy metrics is frankly irresponsible.”
The problem is that people *will* do irresponsible things with any metrics you throw at them, and that the persons or institutions being evaluated by the metric often have *no control* over how it is used. The designers of the metric share a responsibility; they should, whenever possible, try to consider the potential negative effects of their choices. (I was worried about incentives to artificially increase publication counts; David about the incentive to publish in, for example, POPL rather than ICFP.)
(Insert here your preferred story of administrative staff or politics taking a metric out of its context to make decisions wasting considerable time and money. One I like is the fact that because the Shangai Rankings are computed by total counts rather than per-capita proportions, politicians and higher-administration staff all over the world are ordering smaller universities and schools to cluster and *merge* in order to improve their Shangai Ranking.)
I would expect proposed metrics to come with:
– a well-defined goal of which problem they are trying to solve (is it about prospective students? about making hiring decisions?) and what they are trying to measure/approximate
– a careful analysis of the positive and negative side-effects of having this metric become a de-facto standard
Thank you, David and Mike, for your comments. My position is basically what Mike said — the analysis that my app offers was never meant to be a final, authoritative statement. It is a first draft that ought to be refined through comments and discussion.
As David says, there are differences in opinion within the PL community about the relative quality of the four venues. But it’s also true that those debates are irrelevant to the goals of the tool, which is to give students a partial picture of where high-quality PL research is taking place. Also, as David says, SIGPLAN refers to all four venues as “flagship”. Given all this, I will coalesce the two PL categories some time in the next few days.
Thanks for manually adding NEU. The reason why the Brown dataset does not include it is not because that dataset is based on the first 50 USN&WR universities, but because it is based on outdated information. NEU is USN&WR rank 42
see here .
Could you please output the whole list of universities, not just the first 13 or so? I am curious how NEU does in algorithms and complexity, and in fact found out about your work while trying to include NEU in the Brown dataset (which was a first step towards an answer).
I think it would be most useful to let the user specify how much weight to give to each venue. For example, I disagree with the weight given by your app (STOC/FOCS/SODA=1, rest 0), as well as with that of the ranking by Seddighin and Hajiaghayi.
Great article! Check out also TFE Times’ computer science rankings here: https://tfetimes.com/best-computer-science-program-rankings/
I should add that http://csrankings.org/ is the moral successor to the system presented in this post.