This piece was originally posted to the ARMA blog on 14 October 2020.
Lizzie Gadd and Richard Holmes share the initial findings of the INORMS Research Evaluation Working Group’s efforts to rate the World University Rankings.
When the INORMS Research Evaluation Working Group (REWG) was formed in 2016, Lizzie asked the representatives of twelve international research management societies where they felt we should focus our attention if we wanted to achieve our aim of making research evaluation more meaningful, responsible and effective. They were unanimous: the world university rankings. Although research managers are not always the ones in their institutions that deal with the world university rankers, they are one of the groups that feel their effect most keenly: exclusion from certain funding sources based on ranking position; requests to reverse engineer various indicators to understand their scores, and calls to introduce policies that may lead to better ranking outcomes. And all whilst fully appreciating how problematic rankings are in terms of their methodology, their validity and their significance.
So what could be done? Well, it was clear that one of the key issues with the world ranking bodies is that they are unappointed and they answer to nobody. In an earlier blog post where Lizzie describes the research evaluation environment as a food chain, she put them at the top: predators on which no-one predates. (Although some Scandinavian colleagues see them more as parasites that feed off the healthy organisms: taking but not giving back). And of course the way to topple an apex predator, is to introduce a new one: to make them answerable to the communities they rank. So this is what the INORMS REWG set about doing, by seeking to introduce an evaluation mechanism of their own to rate the rankers.
In some parallel work, the REWG were developing SCOPE, a five-step process for evaluating effectively, so we were keen to follow our own guidance when designing our ranker ratings. And this is how we did so:
Start with what you value
Our first step was to identify what it was we wanted from any mechanism seeking to draw comparisons between universities. What did we value? To this end we sought out wisdom from all those who’ve gone ahead of us in this space: the Berlin Principles on Ranking HEIs, the work of Ellen Hazelkorn, the CWTS principles for responsible use of rankings, the Leiden Manifesto, DORA, Yves Gingras, and many others. From their thoughts we synthesised a draft list of Criteria for Fair and Responsible University Rankings and put them out to the community for comment. We got feedback from a wide range of organisations: universities, academics, publishers and ranking organisations themselves. The feedback was then synthesised into our value document – what we valued about the entity (rankers) under evaluation. These fell into four categories: good governance, transparency, measure what matters, and rigour.
There are lots of reasons we evaluate things. What we’re trying to achieve here is a comparison of the various ranking organisations, with the ultimate purpose of incentivising them to do better. We want to expose where they differ from each other but also to highlight areas that the community cares about where they currently fall short. What we didn’t want to do is create another ranking. It would have been very tempting to do so: “ranking the rankings” has a certain ring to it. But not only would this mean that a ranking organisation got to shout about its league-table-topping status – something we didn’t want to endorse – but we wouldn’t be practising what we preached: a firm belief that it is not possible to place multi-faceted entities on a single scale labelled ‘Top’ and ‘Bottom’.
Options for evaluating
Once we had our list of values, we then set about translating them into measurable criteria – into indicators that were a good proxy for the quality being measured. As anyone who’s ever developed an evaluation approach will know, this is hard. But again, we sought to adhere to our own best practice by providing a matrix by which evaluators could provide both quantitative and qualitative feedback. Quantitative feedback took the form of a simple three-point scale according to whether the ranker fully (2 marks), partially (1 mark) or failed (0 marks) to meet the set criteria. Qualitative feedback took the form of free-text comments. To ensure transparency and mitigate against bias as best we could, we asked a variety of international experts to each assess one of six ranking organisations against the criteria. INORMS REWG members also undertook evaluations, and, in line with the SCOPE principle of ‘evaluating with the evaluated,’ each ranker was also invited to self-assess themselves. Only one ranking organisation, CWTS Leiden, accepted our offer to self-assess and they provided free-text comments rather than scores. All this feedback was then forwarded to our senior expert reviewer, Dr Richard Holmes, author of the University Ranking Watch blog, and certainly one of the most knowledgeable University Rankings experts in the world. He was able to combine the feedback from our international experts with his own, often inside, knowledge of the rankings, to enable a really robust, expert assessment.
Of course all good evaluations should probe their approach, which is something we sought to do during the design stage, but something we also came back to post-evaluation. We observed some criteria where rankings might be disadvantaged for good practice – for example, where a ranking did not use surveys and so could not score. This led us to introducing ‘Not Applicable’ categories to ensure they would not be penalised. One or two questions were also multi-part which made it difficult to assess fairly across the rankers. In any future iteration of the approach we would seek to correct this. We noted that the ‘partially meets’ category is also very broad, ranging from a touch short of perfect to a smidge better than fail. In future, a more granular five- or even ten-point grading system might provide a clearer picture as to where a ranking succeeds and where it needs to improve. In short, there were some learning points. But that’s normal. And we think the results provide a really important proof-of-concept for evaluating the world rankings.
So what did we find? Well we applied our approach to six of the largest and most influential world university rankings: ARWU, THE WR, QS, U-Multirank, CWTS Leiden and US News & World Report. A full report will be forthcoming and the data showing the expert assessments and senior expert calibrations are available. A spidergram of the quantitative element is given in Figure 1 and some headline findings are provided below.
Figure 1. Spidergram illustrating the actual scores/total possible score for each world ranker. The full data along with the important qualitative data is available.
The five key expectations of rankers here were that they engaged with the ranked, were self-improving, declared conflicts of interest, were open to correction and dealt with gaming. In the main all ranking organisations made some efforts towards good governance, with clear weaknesses in terms of declaring conflicts of interest: no ranker really did so, even though selling access to their data and consultancy services was commonplace.
The five expectations of rankers here were that they had transparent aims, methods, data sources, open data and financial transparency. Once again there were some strengths when it came to the transparency of the rankers’ aims and methods – even if arguably the methods didn’t always meet the aims. The weaknesses here were around the ability of a third-party to replicate the results (only ARWU achieved full marks here), data availability, and financial transparency (where only U-Multirank achieved full marks).
Measure what matters
The five expectations of rankers here were that they drove good behaviour, measured against mission, measured one thing at a time (no composite indicators), tailored results to different audiences and gave no unfair advantage to universities with particular characteristics. Not surprisingly, this is where most rankings fell down. CWTS Leiden and U-Multirank scored top marks in terms of efforts to drive appropriate use of rankings and measuring only one thing at a time, the others barely scored. Similarly, Leiden & U-Multirank fared quite well on measuring against mission, unlike the others. But no ranking truly tailored their offer to different audiences, assuming that all users – students, funders, universities, would value the different characteristics of universities in the same way. And neither could any whole-heartedly say that they offered no unfair advantage to certain groups.
The one thing university rankings are most criticised for is their methodological invalidity, and so it may come as no surprise that this was another weak section for most world rankers. Here we were looking for rigorous methods, no ‘sloppy’ surveys, validity, sensitivity and honesty about uncertainty. The ranker that did the best here by a country mile was CWTS Leiden, with perfect scores for avoiding the use of opinion surveys (joined by ARWU), good indicator validity (joined by U-Multirank), indicator sensitivity, and the use of error bars to indicate uncertainty. All other rankers scored their lowest in this section.
So there is clearly work to be done here, and we hope that our rating clearly highlights what needs to be done and by whom. And in case any ranking organisation seeks to celebrate their relative ‘success’ here, it’s worth pointing out that a score of 100% on each indicator is what the community would deem to be acceptable. Anything less leaves something to be desired.
One of the criticisms we anticipate is that our expectations are too high. How can we expect rankings to offer no unfair advantage? And how can we expect commercial organisations to draw attention to their conflicts of interest? Our answer would be that just because something is difficult to achieve, doesn’t mean we shouldn’t aspire to it. Some of the sustainable development goals (no poverty, zero hunger) are highly ambitious, but also highly desirable. The beauty of taking a value-led approach, such as that promoted by SCOPE, is that we are driven by what we truly care about, rather than by the art of the possible, or the size of our dataset. If it’s not possible to rank fairly, in accordance with principles developed by the communities being ranked, we would argue that it is the rankings that need to change, not the principles.
We hope this work initiates some reflection on the part of world university ranking organisations. But we also hope it leads to some reflection by those organisations that set so much store by the world rankings: the universities that seek uncritically to climb them; the students and academics that blindly rely on them to decide where to study or work; and the funding organisations that use them as short-cuts to identify quality applicants. This work provides qualitative and quantitative evidence that the world rankings cannot, currently, be relied on for these things. There is no fair, responsible and meaningful university ranking. Not really. Not yet. There are just pockets of good practice that we can perhaps build on if there is the will. Let’s hope there is.