This post was originally published on the MetisTalk blog on 7 April 2022.
Across the UK, a growing number of universities are starting to appoint dedicated research culture staff. At Glasgow, I’ve been lucky enough to take on their research culture portfolio when the fabulous Tanita Casci and Elizabeth Adams, co-founders of Glasgow’s Research Culture work with Miles Padgett, both left within a few weeks of each other. Glasgow has a clear research culture action plan, and much has been achieved. However, a recent research culture survey has, once again, highlighted challenges we’re keen to address. So like many others beginning research culture roles, I find myself asking, where do you start? How do you prioritise what feel like equally pressing needs?
It’s tempting to go straight for the low-hanging fruit: the issues you can fix quickly with little resource; those things well within your jurisdiction. And why not? If they are genuine needs and you can address them swiftly, this might be a good use of your time. And you can show the community you’re dedicated to making progress.
The challenge is that quick-fixes rarely solve the most deep-seated research culture problems, and when you run your annual research culture survey next time around, you might find that the lived experience for most researchers hasn’t changed that much. A lot of initiatives in the research culture space (think EDI and well-being initiatives) come under attack because they start with the low-hanging fruit (think celebrating International Women’s Day or running yoga classes), leading the community to believe their actions are just window-dressing.
Biggest problem first?
So, perhaps we should tackle the biggest issues first? How to give researchers more time to actually do research? How to tackle job precarity amongst early career researchers? How to eradicate toxic power imbalances from our labs? There’s a strong and urgent need to address these issues in our organisations. And if we’re squeamish about this we certainly shouldn’t be taking jobs in research culture. However, many of the bigger issues can’t be solved unilaterally by one institution, or they are a function of wider systemic problems such as the university funding model. So if you just start here, the chances are you might get no further in your whole research culture career.
One approach, of course, is to see what crosses your path and go with the flow. An academic might propose a solution to a local problem that you can support. An external organisation might be offering a leadership training package you can buy into. You might get an opportunity to piggy-back on other internal or external developments that enhance your organisational culture.
However, whilst we don’t want to stick so rigidly to our plans that we miss out on the serendipitous, it feels a little reactive so just be tossed and blown by the winds of opportunity. Call me a control freak, but this isn’t the path for me.
Start with what you value?
As the Chair of the INORMS Research Evaluation Group I have been heavily involved in the development of their SCOPE framework for responsible research evaluation. As such, when thinking about ‘where to start’ with anything, I cannot help but to return to its first tenet, to ‘Start with what you value’. Identifying where to start with our research culture must surely begin with a proper understanding of what we value about a positive research culture: what does good look like? And a sense of the gap between what we have and what we want.
Maybe it’s the researcher in me, but I’m convinced that surfacing the values of our research communities through workshops and surveys is a really good way to get to the heart of the matter. However, you won’t get unanimous agreement on a way forward. You’ll get views ranging from the jaded to the enthusiastic; from the big-pictures to the personal bugbears; from car-parking to career progression, and everything in between.
Where do you start?
I do think that the values surfaced through these exercises have to be our starting point though. We need a strong sense of the lived experience of our research communities: the good, the bad, and the ugly. And an understanding of the issues that mean the most to the most. But we need these values to translate into a portfolio of actions across the short, medium and long term. So, to return to my own question, ‘where do you start?’, I would answer, ‘all of the above’. We need some quick wins, some long-haul too-important-not-to-try ambitions, and an openness to opportunities that aren’t in the plan. What matters is that they are all in line with your institution’s ‘heart’ and that you can evidence your efforts, your progress, and even your failures, as you go.
Ultimately, whilst those of us who care deeply enough about research culture to make it our daily occupation are likely to care deeply about starting in the right place, perhaps it doesn’t really matter where you start, as long as you start?
Historically ‘getting on’ in academia demanded either a list-full of publications or a pocket-full of grant winnings – or both. Thankfully, times are changing. This blog post highlights some of the ways the sector is moving towards fairer research and researcher assessment, and how these developments might help doctoral (and post-doctoral) researchers take a healthier approach to career building. Many might be concepts that are still part of the Hidden Curriculum to you, and to many researchers, yet they will become an increasingly important part of research life.
You do you
At the end of 2021 the UKRI announced they would be introducing a new ‘Resume for Research’ for grant applicants. Developed in response to the over-use of unhelpful quantitative indicators in assessing researcher careers, such as the h-index, and Journal Impact Factors, this is the latest in a series of ‘narrative CV’ formats being introduced by grant funding organisations across Europe. Narrative CVs seek to go beyond reductive metrics and to ascertain, in a qualitative way, researchers’ actual contribution to scholarship: to knowledge, to the development of others, to the research community and to society itself.
In a similar way, the Contributor Role Taxonomy (CRediT) seeks to recognise a wider range of ‘inputs’ to ’outputs’ including software development and data curation, and the Hidden REF campaign sought to celebrate all those people and contributions not rewarded by the current REF process.
Given this move to recognise a much broader range of contributions to research, it would seem sensible to plan your career on these terms. Think about the discoveries you want to make, the real-world impacts you want to have, and the person you want to be. Because if you stay in academia this is likely the way that you will be judged. However, when you break down your ambitions in this way you might decide that academia is not the best place to fulfil them. And that’s a good outcome too.
You can’t be good at everything
Whilst this move to value a broader range of contributions is definitely a good thing, it can inadvertently put even more pressure on researchers who now feel like they have to excel at everything. But it’s important to remember that this is not the point of a greater range of criteria. No-one can be good at everything, and no-one should be expected to be.
For this reason, my workplace, the University of Glasgow takes a ‘preponderance’ approach to promotion criteria, where colleagues have to meet expectations in only four of seven different areas of scholarly activity. Early proponents of narrative CV approaches have also made it clear that early career researchers do not have to complete all the sections. Playing to your strengths is the way to reach fulfilment in so many areas of our lives, and these new assessment mechanisms should enable those with a wider range of strengths to thrive.
Do not compare yourself
Given that the sector is seeking to value a broader range of research contributions – and that we can’t be good at everything – it follows that there is no real benefit to be gained by constantly comparing ourselves to others. I know it’s tempting. People keep telling you that you’re in a crowded marketplace and competition for jobs is fierce. You feel like you need to keep a check on your progress and how you measure up to others who might be competing with you for your next post. But constantly comparing yourself to others is a sure-fire way to misery. In fact, in Matt Haig’s book ‘Reasons to stay alive’ he has a chapter entitled ‘How to be happy’ which simply repeats the phrase ‘Do not compare yourself’ over and over again.
You are unique. Your doctorate is (by definition) unique. What you bring to it is unique. The circumstances you face are unique. And although we’re trying hard to eradicate structural inequalities in academia, they are still in evidence and may well affect you. If your CV looks different to someone else’s, there will be a good reason for that. Comparing yourself to others will not lead to anything good. Don’t do it.
Serendipity plays its part
One of the reasons it’s futile to over-compare ourselves is that serendipity plays such a significant role in our careers. Ask any established researcher to talk through their CV and they will share a number of stories about being in the right place at the right time, a sudden political interest in their area of expertise, or a chance encounter. In fact, as Rich Pancost, Head of the School of Earth Sciences at Bristol University writes about academic careers, “If you get the job you dreamt of, you are brilliant and lucky; and if you do not, it is because you are brilliant and unlucky.”
We are all constrained by opportunities and affected by privilege and luck – or lack of it. So, whilst we can plan all we like – and do plan! – we are not entirely in control of our own destiny. We can use this to our advantage by taking up networking opportunities which may connect us to new people and places. But remembering this when we learn of someone else’s success can be helpful too.
Take (& give) all the help you can get
My final thought on healthy career-building is to take – and give – all the help you can get. No one is an island and research is not a solo activity. Make the most of every conversation with your supervisor, your networks, your research office and doctoral college/graduate school. Get yourself a mentor and take every training opportunity. Find your people on social media and subscribe to some helpful blogs. But don’t just be a taker. To get the most out of these opportunities be a giver too. Share your own hints and tips on Twitter. Become a mentor yourself. Contribute to doctoral researcher networks. It is in giving that we receive, and a significant component of mental well-being is looking out for others where we can.
Being a doctoral or post-doctoral researcher is not an easy task, and uncertainties around your next step can be unsettling. This has been made more difficult by the excessive use of metrics in the assessment of researchers. It is early days, but an increased focus on research culture and drives towards fairer assessment do look set to recognise a broader range of skills and contributions to the research endeavour. Planning your career along these lines should lead to healthier and more fulfilling outcomes either within or outwith academia.
Lizzie Gadd makes the case for open research being required not rewarded.
I recently attended two events: the first was a workshop run by the ON-MERRIT team, a Horizon 2020 project seeking to understand how open research practices might actually worsen existing inequalities. And the second was the UKRI Enhancing Research Culture event at which I was invited to sit on a panel discussing how to foster an open research culture. At both events the inevitable question arose: ‘how do we incentivise open research?’.
And given the
existing incentives system is largely based around evaluating and rewarding a
researcher’s publications, citations, and journal choices, our instinct is look
to alternative evaluation mechanisms to entice them into the brave new world of
open. It seems logical, right? In order to incentivise
open research we simply need to measure and
reward open research. If we just displace the Impact Factor with the TOP Factor, the h-index with the r-index
and citation-based rankings with openness rankings, all
will be well.
But to my mind this
logic is flawed.
openness is not a direct replacement for citedness. Although both arguably have a link with
‘quality’ (openness may lead to it and citedness may indicate it) they are not
quite the same thing. And it would be dangerous to assume that all open things
are high quality things.
So we can add open research requirements to our promotion criteria, but we are still left with the conundrum as to how to assess research quality. And until an alternative for citations is found, folks are liable to keep relying on them as an easy (false) proxy. So we can think we’ve fixed the incentivisation problem by focusing on open research indicators, but we haven’t dealt with the much bigger and much more powerful disincentivisation problem of citation indicators.
If we’re looking to openness to improve our research culture, incentivising openness by measuring it feels pretty counterproductive to me.
Secondly, as I’ve argued before, open research practices are still unheard of by some and the processes by which to achieve them are not always clear. Open research practices need to be enabled before we can incentivise them. Of course related to this is the fact that some open research practices are completely irrelevant to some disciplinary communities (you’ll have a hard job pre-registering your sculpture). And undoubtedly those from wealthy institutions are likely to get much more support with open research practices than those from poorer ones. In this way, we’re in danger of embedding existing inequalities in our pursuit of open practices – as the ON-MERRIT team are exploring.
But in addition to
these pragmatic reasons as to why we can’t
easily incentivise open research by measuring it, there is a darned good reason
why we shouldn’t turn to measurement to
do this job for us. And that is that HE is already significantly over-evaluated
assessed from dawn til dusk: for recruitment, probation, appraisal, promotion,
grant applications, and journal peer review. There is no dimension of their
work that goes unscrutinised: where they work, who they collaborate with, how
much they have written, the grants they have won, the citations they’ve
accrued, the impact of their work, the PGRs they’ve supervised – it’s endless.
And this in combination with a highly competitive working environment makes
academia a hotbed for toxic behaviours, mental health difficulties, and all the
poor practices we blame on “the incentives”. (Although Tal
Yarkoni recently did an excellent job of calling out those who rely on
blaming the incentives to excuse poor behaviours).
If we’re looking to openness to improve our research culture, incentivising openness by measuring it feels pretty counterproductive to me. We don’t want to switch from narrow definitions of exceptionalism, to broader ‘open’ definitions of exceptionalism, but away from exceptionalism altogether. Adding open to a broken thing just leaves us with an open broken thing.
Surely this is what we want for open research? Not that it should be treated as an above-and-beyond option for the savvy few, but that it should be a bread-and-butter expectation on everyone.
So how do we
Well, this is where
I think we can learn from other aspects of our research environment. Because at
the end of the day, open research practices are simply a set of processes,
protocols and standards that we want all researchers to adhere to as relevant to
their discipline. And we put plenty of
these expectations on our researchers already, such as gaining ethical
approvals, adhering to reporting guidelines, and following health & safety
There’s no glory
associated with running due diligence on your research partners and following
GDPR legislation won’t give you an advantage in a promotion case. These are
basic professional expectations placed on every self-respecting researcher. And
whilst there are no prizes for those who adhere to them, there are serious
consequences for those that don’t. Surely this is what we want for open research? Not
that it should be treated as an above-and-beyond option for the savvy few, but
that it should be a bread-and-butter expectation on everyone.
Now I appreciate there is probably an interim period where institutions want to raise awareness of open research practices (as I said before, they need to be enabled before they can be incentivised). And during this period, running some ‘Open Research Culture Awards’ or offering ‘Open research hero badges’ to web pages might have their place. But we can’t dwell here for long. We need to move quite rapidly to this being a basic expectation on researchers. We have to define what open research expectations are relevant to each discipline. Add these expectations to our Codes of Good Research Practice. Train researchers in their obligations. Monitor (at discipline/HEI level) engagement with these expectations. And hold research leads accountable for the practices of their research groups.
Adding open to a broken thing just leaves us with an open broken thing.
To my mind, the same
applies to measuring open research at institutional level, for example in REF
exercises. We should require HEIs to expect and enable disciplinary appropriate
open research practices from their researchers and to evidence that they a)
communicate those expectations, b) support researchers to meet those
expectations, and c) are improving on meeting those expectations. That’s all.
No tricky counting mechanisms. No arbitrary thresholds. No extra points for
services that are just the product of wealth.
Of course, if we are
going to monitor take up of open research at discipline and university level,
we do need services that indicate institutional engagement with open research
practices. But again I see this as being an interim measure, and more to highlight
where work needs to be done than to give anyone boasting rights. When open
research becomes the modus operandi for everybody, monitoring just becomes a
quality assurance process. There’s no point ranking institutions on the
percentage of their outputs that are open access when everybody hits 100%.
I know this doesn’t
tackle the disincentivisation problem of journal impact factors, but open never
have moved from a serials crisis (where the costs were high, the speeds were
slow, and only a few could read them) to an open serials crisis (where the
costs are high, the speeds are slow, and only a few can publish in them).
To me this is a separate problem that could be fixed quite easily if funders
placed far bolder expectations on their researchers to only publish on their
own platforms – but that’s another
We all want open research and we all want to fix the incentives problem as we see this as slowing our progress towards open research. But I think offering up one as the solution to the other is not going to get us where we want to go. Indeed, I think it’s potentially in danger of exacerbating unhelpful tendencies towards exceptionalism when what we really want is boring old consistent, standards-compliant, rigorous research.
Campbells law rightly tells us that we get what we
measure, but the inverse – that we need to measure something in order to get it
– is not always true. In our rightful pursuit of all things open, I
think it’s important that we remember this.
Elizabeth Gadd is Head of Research Operations at the University of Glasgow. She is the chair of the Lis-Bibliometrics Forum and co-Champions the ARMA Research Evaluation Special Interest Group. She also chairs the INORMS International Research Evaluation Working Group.
Lizzie Gadd argues that any commitment to responsible research assessment as outlined in DORA (Declaration on Research Assessment) and other such manifestos needs to include action on global university rankings. Highlighting four fundamental critiques of the way in which journal metrics and university rankings have been deployed in higher education, she proposes universities could unite around the principle of being ‘much more than their rank’.
More and more institutions are signing up to responsible metrics manifestos such as DORA – which is great. This is no doubt influenced by funder demands that they do so – which is also great. And these manifestos are having a positive impact on researcher-level evaluation – which is triply great. But, as we all know, researcher-level evaluation issues, such as avoiding Journal Impact Factors, are only one element of the sector’s research evaluation problems.
UKRI Chief Executive Ottoline Leyser recently pointed out that any evaluation further up the food-chain in the form of university- or country-level evaluations ultimately has an impact on individual researchers. And of course the most influential of these, at the top of the research evaluation food-chain, are the global university rankings.
So why, I often ask myself, do we laud universities for taking a responsible approach to journal metrics and turn a blind eye to their participation in, and celebration of, the global rankings?
Indeed, when you look at the characteristics of Journal Impact Factors (JIFs) and the characteristics of global university rankings, they both fall foul of exactly the same four critiques.
1. The construction problem
As DORA states, there are significant issues with the calculation of the JIF: the average cites per paper for a journal over two years. Firstly, providing the mean cites-per-paper of a skewed dataset is not statistically sensible. Secondly, whilst the numerator includes all citations to the journal, the denominator excludes ‘non-citable items’ such as editorials and letters – even if they have been cited. Thirdly, the time window of two years is arguably not long enough to capture citation activity in less citation dense fields, as a result you can’t compare a JIF in one field with that from another.
However, global university rankings are subject to even harsher criticisms about their construction. The indicators they use are a poor proxy for the concept they seek to evaluate (the use of staff:student ratios as a proxy for teaching quality for example). The concepts they seek to evaluate are not representative of the work of all universities (societal impacts are not captured at all). The data sources they use are heavily biased towards the global north. They often use sloppy reputation-based opinion polls. And worst of all, they combine indicators together using arbitrary weightings, a slight change in which can have a significant impact on a university’s rank.
2. The validity problem
Construction issues aside, problems with the JIF really began when it was repurposed from an indicator to decide which journals should appear in Garfield’s citation index, to one used by libraries to inform collection development, and then by researchers to choose where to publish and finally by readers (and others) to decide which research was the best for being published there. It had become an invalid proxy for quality, rather than as a means of ensuring the most citations were captured by a citation index.
Whilst the JIF may have inadvertently found itself in this position, some of the global rankings quite deliberately over-state their meaning. Indeed, each of the ‘big three’ global rankings (ARWU, QS and THE WUR) claim to reveal which are the ‘top’ universities (despite using different methods for reaching their different conclusions). However, given the many and varied forms of higher education institutions on the planet, none of these high-profile rankings articulates exactly what their ‘top’ universities are supposed to be top at. The truth is that the ‘top’ universities are mainly top at being old, large, wealthy, English-speaking, research-focussed and based in the global north.
3. The application problem
Of course, once we have indicators that are an invalid proxy for the thing they claim to measure (JIFs signifying ’quality’ and rankings signifying ‘excellence’) third parties will make poor use of them for decision-making. Thus, funders and institutions started to judge researchers based on the number of outputs they had in high-JIF journals, as though that somehow reflected on the quality of their research and of them as a researcher.
In a similar way, we know that some of the biggest users of the global university rankings are students seeking to choose where to study (even though no global ranking provides any reliable indication of teaching quality) because who doesn’t want to study at a ‘top’ university? But it’s not just students; institutions and employers are also known to judge applicants based on the rank of their alma mater. Government-funded studentship schemes will also often only support attendance at top 200 institutions.
4. The impact problem
Ultimately, these issues have huge impacts on both individual careers and the scholarly enterprise. The problems associated with the pursuit of publication in high-JIF journals have been well-documented and include higher APC costs, publication delays, publication of only positive findings on hot topics, high retraction rates, and negative impacts on the transition to open research practices.
The problems associated with the pursuit of a high university ranking are less well-documented but are equally, if not more, concerning. At individual level, students can be denied the opportunity to study at their institution of choice and career prospects can be hampered through conscious or unconscious ranking-based bias. At institution level, ranking obsession can lead to draconian hiring, firing and reward practices based on publication indicators. At system level we see increasing numbers of countries investing in ‘world-class university’ initiatives that concentrate resource in a few institutions whilst starving the rest. There is a growing inequity both within and between countries’ higher education offerings that should seriously concern us all.
What to do?
If we agree that global university rankings are an equally problematic form of irresponsible research evaluation as the Journal Impact Factor, we have to ask ourselves why their usage and promotion does not form an explicit requirement of responsible metrics manifestos. An easy answer is that universities are the ’victim’ not the perpetrator of the rankings. However, universities are equally complicit in providing data to, and promoting the outcomes of, global rankings. The real answer is that the rankings are so heavily used by those outside of universities that not to participate would amount to financial and reputational suicide.
rankings are so heavily used by those outside of universities that not to participate would amount to financial and reputational suicide
Despite this, universities do have both the power and the responsibility to take action on global university rankings that would be entirely in keeping with any claim to practice responsible metrics. This could involve:
Avoiding setting KPIs based on the current composite global university rankings.
Avoiding promoting a university’s ranking outcome.
Avoiding legitimising global rankings by hosting, attending, or speaking at, ranking-promoting summits and conferences.
Working together with other global universities to redefine university quality (or more accurately, qualities) and to develop better ways of evaluating these.
I recently argued that university associations might develop a ‘Much more than our rank’ campaign. This would serve all universities equally – from those yet to get a foothold on the current rankings, to those at the top. Every university has more to offer than is currently measured by the global university rankings – something that I’m sure even the ranking agencies would admit. Such declarations would move universities from judged to judge, from competitor to collaborator. It would give them the opportunity to redefine and celebrate the diverse characteristics of a thriving university beyond the rankings’ narrow and substandard notions of ‘excellence’.
The time has come for us to extend our definition of responsible metrics to include action with regards to the global university rankings. I’m not oblivious to the challenges, and I am certainly open to dialogue about what this might look like. But, we shouldn’t continue to turn a blind eye to the poor construction, validity, application and impact of global rankings, whilst claiming to support and practice responsible metrics. We have to start somewhere, and we have to do it together, but we need to be brave enough to engage in this conversation.
The author is very grateful to Stephen Curry for feedback on the first draft of this blogpost.
Lizzie Gadd & Gareth Cole discuss the practical challenges of monitoring progress towards institutional open research data ambitions.
Loughborough University has recently introduced a new Open Research Position Statement which sets out some clear ambitions for open access, open data and open methods. As part of this work we’re looking at how we can monitor our progress against those ambitions. Of course, for open access, we’re all over it. We have to count how many of our outputs are available openly in accordance with various funder policies anyway. But there are no equivalent demands for data. OK, all data resulting from some funded projects need to be made available openly, but no-one’s really counting – yet. And anyway, our ambitions go beyond that – we’d like to encourage all data to be made available openly and ‘FAIRly’ where possible.
So how do we measure that? Well, with difficulty it would seem. And here’s why:
In the world of journal articles, although there are disciplinary differences as to the number of articles that are produced, every article is roughly equivalent in size and significance to another. Data are not like that. What qualifies as a single unit of data, thus receiving its own metadata record, might be a photograph or a five-terabyte dataset. So it would be a bit unfair to compare the volume of these. And there is currently no agreement as to ‘how much data’ (in size, effort, or complexity) there needs to be to qualify for a unique identifier.
What qualifies as a single unit of data, thus receiving its own metadata record, might be a photograph or a five-terabyte dataset.
But it’s not just what counts as a unit of data but what counts as a citable unit of data that differs. A deposit of twenty files could have one DOI/Identifier or twenty DOIs depending on how it is split up. This means that potentially there could be citation advantages or disadvantages for those that deposit their data in aggregate or individually – but this would entirely depend on how the citer chooses to cite it.
For journal articles, full-text versions are duplicated all over the place. The same article might be available on multiple repositories, pre-print servers and the publisher’s site. In fact, whilst there are concerns about version control, there are many benefits to such duplicates in terms of discovery and archiving (Lots of Copies Keeps Stuff Safe [LOCKSS] and all that). But for data, it’s not good practice to duplicate in different repositories. This is both for versioning reasons (if you update in one place, you then have to update in others) and for DOI reasons (two instances usually means two DOIs which means that any data citations you get will be split across two sources).
So if we wanted to identify all the data produced by Loughborough academics, we’d have a pretty difficult job doing it. Some will be on our repository, but other data will be spread across multiple different archives. Data citation and linking services such as Datacite and Scholix may ultimately offer a solution here of course, but as others have noted, these have a long way to go before they are truly useful for monitoring purposes. Datacite only indexes data with a Datacite DOI. And Scholix only surfaces links between articles and datasets, not the independent existence of data. Some services, such as Dimensions, only index records that have an item type of “Dataset”. This means that data taking other forms such as “media” or “figure” won’t appear in Dimensions, thus disenfranchising those researchers who use “data” but not “datasets”.
But the biggest problem these services face is that they rely on consistent high-quality metadata collection, curation and sharing by all the different data repositories.
But the biggest problem these services face is that they rely on consistent high-quality metadata collection, curation and sharing by all the different data repositories. And we’re just not there yet. (Although all repositories which mint DataCite DOIs will need to comply with the minimum/mandatory requirements to mint the DOI in the first place). And a particular problem for institutions seeking to measure their own dispersed data output is that few repositories expose author affiliation data even where they do comprehensively collect it. And this leads us on to our third point.
The authorship of journal articles is increasingly controlled and subject to much guidance. Learned societies provide guidance, journals provide guidance, institutions sometimes have their own guidance. The CRediT taxonomy (whilst not without problems) was introduced to make it absolutely explicit as to who did what on a journal article. The same is not usually true of data.
Of course, data is created rather than authored as the DataCite schema makes clear. But there is no way of ensuring that all the data creators have been added to the metadata record, even if the depositors do always know who they are. And whilst there is no real glory associated with data ownership, this problem isn’t going to be quickly resolved. As with journal articles, often the list of contributors is likely to be very long so there needs to be some incentive to do this carefully and well.
And whilst there is no real glory associated with data ownership, this problem isn’t going to be quickly resolved.
Any scientist can be forced to write bad metadata.
There seems to be scope for some CRediT-type contributor guidance for data to ensure all the right people get the right credit. (Maybe a Research Data Alliance Working Group?) And then there needs to be some motivation for depositors to stick to it.
Although the standard of journal peer review is variable and hotly contested as a mechanism to signify quality, at least all journal articles are subject to some form of it prior to publication. Data are not currently peer reviewed (unless submitted as a data paper or if the dataset is provided as supplementary information to a journal submission). And although data can be cited, this appears to be still comparatively quite rare. This may partly be due to the challenge of ‘counting’ data citations due to huge variations in citation quality, whether data is cited at all (or added as part of a data availability statement), and disciplinary differences in the way this is done. And there is a big difference between a data citation which just states ‘this data is available and relevant to my study’ and a data citation which signifies that ‘the data in question has actually been re-analysed or repurposed in my study’. But a data citation doesn’t currently differentiate between the two.
There is also the challenge that FAIR data may not be open data, and vice versa. Some data can never be open data for confidentiality reasons. So in our attempt to pursue open research practices, and given a choice, what do we count? Open data that may not be FAIR? FAIR data that may not be open? Or only data that are both? And if so, how fair is that?
So where does this leave us at Loughborough? Well, in a less than satisfactory situation to be honest. We could look at the number of data deposits (or depositors) in our Research Repository per School over time to give us an idea of growth. But this will only give us a very partial picture. We could do a similar count of research projects on ResearchFish with associated data, or e-theses with related data records, but again, this will only give us a small window onto our research data activity. Going forward we might look at engagement with the data management planning tool, DMP Online, over time, but again this is likely to shine more light on disciplines that have to provide DMPs as part of funding applications and PhD studies.
So, whilst we can encourage individuals to deposit data, and require narrative descriptions of their engagement with this important practice for annual appraisals, promotions, and recruitment, we have no meaningful way of monitoring this engagement at department or University-level. And as for benchmarking our engagement with that happening elsewhere, this currently feels like it’s a very long way off.
The big fear of course, is that this is where commercial players rock up and offer to do this all for us – for a price. (Already happening). And given that data is a research output not currently controlled by such outfits, it would be a very great shame to have to pay someone else to tell us about the activities of our own staff. In our view.
Really hoping that some clever, community-minded data management folk are able to help with this.
Lizzie Gadd speculates as to why Elsevier endorsed the Leiden Manifesto rather than signing DORA, and what the implications might be.
If an organisation wants to make a public commitment to responsible research evaluation they have three main options: i) sign DORA, ii) endorse the Leiden Manifesto (LM), or iii) go bespoke – usually with a statement based on DORA, the LM, or the Metric Tide principles.
The LIS-Bibliometrics annual responsible metrics survey shows that research-performing organisations adopt a wide range of responses to this including sometimes signing DORA and adopting the LM. But when it comes to publishers and metric vendors, they tend to go for DORA. Signing DORA is a proactive, public statement and there is an open, independent record of your commitment. DORA also has an active Chair in Professor Stephen Curry, and a small staff in the form of a program director and community manager, all of whom will publicly endorse your signing which leads to good PR for the organisation.
A public endorsement of the LM leads to no such fanfare. Indeed, the LM feels rather abandoned by comparison. Despite a website and blog, there has been little active promotion of the Manifesto, nor any public recognition for anyone seeking to endorse it. Indeed one can’t help wondering how differently the LM would operate if it had been born in a UK institution subject to the impact-driven strictures of the REF?
But despite this, Elsevier recently announced that they had chosen the Leiden Manifesto over DORA. Which leads us to ask i) why? And ii) what will this mean for their publishing and analytics business?
Why not DORA?
Obviously I wasn’t party to the conversations that led to this decision and can only speculate. But for what it’s worth, my speculation goes a bit like this:
So, unlike the LM which provides ten principles to which all adopters should adhere, DORA makes different demands of different stakeholders. So research institutions get off pretty lightly with just two requirements: i) don’t use journals as proxies for the quality of papers, and ii) be transparent about your reward criteria. Publishers and metrics suppliers, however, are subject to longer lists (see box) and of course, Elsevier are both. And it is within these lists of requirements that I think we find our answers.
Positioning CiteScore as the JIF’s responsible twin.
Firstly, DORA demands that publishers ‘greatly reduce emphasis on JIF as a promotional tool’. However, Elsevier have invested heavily in CiteScore (their alternative to the JIF) and are not likely to want to reduce emphasis on it. Indeed the press release announcing their endorsement of the LM provided as an example, the way they’d recently tweaked the calculation of CiteScore to ensure it met some of the LM principles, positioning it as a ‘responsible metric’ if you will. This is something they’d struggle to get away with under DORA.
Open citations? Over my dead body
One of the less well-discussed requirements of DORA for publishers is to “remove all reuse limitations on reference lists in research articles and make them available under the Creative Commons Public Domain Dedication.” In other words, DORA expects publishers to engage with open citations. This is something Elsevier have infamously failed to do.
Open data? You’ll have to catch me first
And finally, DORA expects metric suppliers to not only “be open and transparent by providing data and methods used to calculate all metrics” (which they partly do for subscribers) but to “Provide the data under a licence that allows unrestricted reuse, and provide computational access to data, where possible” (which they don’t).
So whereas DORA is a relatively easy sign for HEIs (only two requirements) for publishers, it’s more tricky than might first appear (five requirements) and for an organisation like Elsevier which also supplies metrics, they have to contend with a further four requirements, which would essentially eat away at their profits. And we all know that they’re only just scraping by, bless them.
The impact of endorsing the Leiden Manifesto
But isn’t it good enough that they’ve endorsed the Leiden Manifesto? After all, it’s a comprehensive set of ten principles for the responsible use of bibliometrics? Well, being a seasoned grumbler about some of the less savoury aspects of Elsevier’s SciVal, I decided to take to the discussion lists to see whether they saw this move as a the beginning or the end of their responsible metrics journey. Was this the start of a huge housekeeping exercise which would sweep away the h-index from researcher profiles? Disinfect the unstable Field-Weighted Citation Index from author rankings? And provide health-warnings around some of the other over-promising and under-delivering indicators?
“There is nothing inherently wrong with the h-index” said Holly Falk-Krzesinski, Elsevier’s Vice-President for Research Intelligence, pointing to three of the Leiden Manifesto’s principles where she felt it passed muster. (Despite on the same day, Elsevier’s Andrew Plume questioning its validity). And as part of a basket of metrics, she considers the FWCI is a perfectly usable indicator for researchers. (Something Elsevier’s own SciVal Advisors disagree with). And she believes the h-index is “not displayed in any special or prominent way” on Pure Researcher Profiles. Erm…
And after several rounds of this, frankly, I gave up. And spent a weekend comfort-eating Kettle chips. Because I care deeply about this. And, honestly, it felt like to Elsevier it was just another game to be played.
Responsible is as responsible does
Back in 2018 I made the point that if we weren’t careful, responsible metrics statements could, in an ironic turn, easily become ‘bad metrics’, falsely signifying a responsible approach to metrics that wasn’t there in practice. And the reason these statements are so vulnerable to this is that neither DORA nor the LM are formally policed. Anyone can claim to be a follower and the worst that can happen is that someone calls out your hypocrisy on Twitter. Which does happen. And is sometimes even effective.
It is for this reason that the Wellcome Trust have stated that adopting a set of responsible metrics principles is not enough. If you want to receive their research funding from 2021, you need to demonstrate that you are acting on your principles. Which is fair. After all, if you want Athen Swan accreditation, or Race Equality Chartership or a Stonewall Charter, you have to provide evidence and apply for it. It’s not self-service. You can’t just pronounce yourself a winner. And I can’t help wondering: yes, Elsevier has endorsed the Leiden Manifesto, but would the Leiden Manifesto (given the chance) endorse Elsevier?
Now I know that CWTS and DORA would run a mile from such a proposition, but that doesn’t mean it’s not needed. Responsible-metrics-washing is rife. And whilst I‘d rather folks washed with responsible metrics than anything else – and I’m sure a few good things will come out of it – it does rather feel like yet another instance of a commercial organisation paying lip-service to a community agenda for their own ends (see also: open access and copyright retention).
Right on cue, Helen Lewis in The Atlantic recently described the ”self-preservation instinct [that] operates when private companies struggle to acclimatize to life in a world where many consumers vocally support social-justice causes”. “Progressive values are now a powerful branding tool” she writes, and “Brands will gravitate toward low-cost, high-noise signals as a substitute for genuine reform, to ensure their survival.” Correct me if I’m wrong but that sounds pretty apposite?
Of course, it’s early days for Elsevier’s Leiden Manifesto journey and Andrew Plume did seek to reassure me in a video call that they were still working through all the implications. So let’s hope I’m worrying about nothing and we’ll be waving goodbye to the h-index in Elsevier products any day soon. But if nothing does transpire, I know as the developer of a responsible metrics model myself, that I’d feel pretty sick about it being used as empty virtue-signalling. And it does occur to me that funders’ seeking to hold institutions to account for their responsible research evaluation practices might do well to direct their attention to the publishers they fund.
Otherwise I fear it really will be case of, well, Elsevier have endorsed the Leiden Manifesto: so what?
This piece was originally posted to the ARMA blog on 14 October 2020.
Lizzie Gadd and Richard Holmes share the initial findings of the INORMS Research Evaluation Working Group’s efforts to rate the World University Rankings.
When the INORMS Research Evaluation Working Group (REWG) was formed in 2016, Lizzie asked the representatives of twelve international research management societies where they felt we should focus our attention if we wanted to achieve our aim of making research evaluation more meaningful, responsible and effective. They were unanimous: the world university rankings. Although research managers are not always the ones in their institutions that deal with the world university rankers, they are one of the groups that feel their effect most keenly: exclusion from certain funding sources based on ranking position; requests to reverse engineer various indicators to understand their scores, and calls to introduce policies that may lead to better ranking outcomes. And all whilst fully appreciating how problematic rankings are in terms of their methodology, their validity and their significance.
So what could be done? Well, it was clear that one of the key issues with the world ranking bodies is that they are unappointed and they answer to nobody. In an earlier blog post where Lizzie describes the research evaluation environment as a food chain, she put them at the top: predators on which no-one predates. (Although some Scandinavian colleagues see them more as parasites that feed off the healthy organisms: taking but not giving back). And of course the way to topple an apex predator, is to introduce a new one: to make them answerable to the communities they rank. So this is what the INORMS REWG set about doing, by seeking to introduce an evaluation mechanism of their own to rate the rankers.
In some parallel work, the REWG were developing SCOPE, a five-step process for evaluating effectively, so we were keen to follow our own guidance when designing our ranker ratings. And this is how we did so:
Start with what you value
Our first step was to identify what it was we wanted from any mechanism seeking to draw comparisons between universities. What did we value? To this end we sought out wisdom from all those who’ve gone ahead of us in this space: the Berlin Principles on Ranking HEIs, the work of Ellen Hazelkorn, the CWTS principles for responsible use of rankings, the Leiden Manifesto, DORA, Yves Gingras, and many others. From their thoughts we synthesised a draft list of Criteria for Fair and Responsible University Rankings and put them out to the community for comment. We got feedback from a wide range of organisations: universities, academics, publishers and ranking organisations themselves. The feedback was then synthesised into our value document – what we valued about the entity (rankers) under evaluation. These fell into four categories: good governance, transparency, measure what matters, and rigour.
There are lots of reasons we evaluate things. What we’re trying to achieve here is a comparison of the various ranking organisations, with the ultimate purpose of incentivising them to do better. We want to expose where they differ from each other but also to highlight areas that the community cares about where they currently fall short. What we didn’t want to do is create another ranking. It would have been very tempting to do so: “ranking the rankings” has a certain ring to it. But not only would this mean that a ranking organisation got to shout about its league-table-topping status – something we didn’t want to endorse – but we wouldn’t be practising what we preached: a firm belief that it is not possible to place multi-faceted entities on a single scale labelled ‘Top’ and ‘Bottom’.
Options for evaluating
Once we had our list of values, we then set about translating them into measurable criteria – into indicators that were a good proxy for the quality being measured. As anyone who’s ever developed an evaluation approach will know, this is hard. But again, we sought to adhere to our own best practice by providing a matrix by which evaluators could provide both quantitative and qualitative feedback. Quantitative feedback took the form of a simple three-point scale according to whether the ranker fully (2 marks), partially (1 mark) or failed (0 marks) to meet the set criteria. Qualitative feedback took the form of free-text comments. To ensure transparency and mitigate against bias as best we could, we asked a variety of international experts to each assess one of six ranking organisations against the criteria. INORMS REWG members also undertook evaluations, and, in line with the SCOPE principle of ‘evaluating with the evaluated,’ each ranker was also invited to self-assess themselves. Only one ranking organisation, CWTS Leiden, accepted our offer to self-assess and they provided free-text comments rather than scores. All this feedback was then forwarded to our senior expert reviewer, Dr Richard Holmes, author of the University Ranking Watch blog, and certainly one of the most knowledgeable University Rankings experts in the world. He was able to combine the feedback from our international experts with his own, often inside, knowledge of the rankings, to enable a really robust, expert assessment.
Of course all good evaluations should probe their approach, which is something we sought to do during the design stage, but something we also came back to post-evaluation. We observed some criteria where rankings might be disadvantaged for good practice – for example, where a ranking did not use surveys and so could not score. This led us to introducing ‘Not Applicable’ categories to ensure they would not be penalised. One or two questions were also multi-part which made it difficult to assess fairly across the rankers. In any future iteration of the approach we would seek to correct this. We noted that the ‘partially meets’ category is also very broad, ranging from a touch short of perfect to a smidge better than fail. In future, a more granular five- or even ten-point grading system might provide a clearer picture as to where a ranking succeeds and where it needs to improve. In short, there were some learning points. But that’s normal. And we think the results provide a really important proof-of-concept for evaluating the world rankings.
Figure 1. Spidergram illustrating the actual scores/total possible score for each world ranker. The full data along with the important qualitative data is available.
The five key expectations of rankers here were that they engaged with the ranked, were self-improving, declared conflicts of interest, were open to correction and dealt with gaming. In the main all ranking organisations made some efforts towards good governance, with clear weaknesses in terms of declaring conflicts of interest: no ranker really did so, even though selling access to their data and consultancy services was commonplace.
The five expectations of rankers here were that they had transparent aims, methods, data sources, open data and financial transparency. Once again there were some strengths when it came to the transparency of the rankers’ aims and methods – even if arguably the methods didn’t always meet the aims. The weaknesses here were around the ability of a third-party to replicate the results (only ARWU achieved full marks here), data availability, and financial transparency (where only U-Multirank achieved full marks).
Measure what matters
The five expectations of rankers here were that they drove good behaviour, measured against mission, measured one thing at a time (no composite indicators), tailored results to different audiences and gave no unfair advantage to universities with particular characteristics. Not surprisingly, this is where most rankings fell down. CWTS Leiden and U-Multirank scored top marks in terms of efforts to drive appropriate use of rankings and measuring only one thing at a time, the others barely scored. Similarly, Leiden & U-Multirank fared quite well on measuring against mission, unlike the others. But no ranking truly tailored their offer to different audiences, assuming that all users – students, funders, universities, would value the different characteristics of universities in the same way. And neither could any whole-heartedly say that they offered no unfair advantage to certain groups.
The one thing university rankings are most criticised for is their methodological invalidity, and so it may come as no surprise that this was another weak section for most world rankers. Here we were looking for rigorous methods, no ‘sloppy’ surveys, validity, sensitivity and honesty about uncertainty. The ranker that did the best here by a country mile was CWTS Leiden, with perfect scores for avoiding the use of opinion surveys (joined by ARWU), good indicator validity (joined by U-Multirank), indicator sensitivity, and the use of error bars to indicate uncertainty. All other rankers scored their lowest in this section.
So there is clearly work to be done here, and we hope that our rating clearly highlights what needs to be done and by whom. And in case any ranking organisation seeks to celebrate their relative ‘success’ here, it’s worth pointing out that a score of 100% on each indicator is what the community would deem to be acceptable. Anything less leaves something to be desired.
One of the criticisms we anticipate is that our expectations are too high. How can we expect rankings to offer no unfair advantage? And how can we expect commercial organisations to draw attention to their conflicts of interest? Our answer would be that just because something is difficult to achieve, doesn’t mean we shouldn’t aspire to it. Some of the sustainable development goals (no poverty, zero hunger) are highly ambitious, but also highly desirable. The beauty of taking a value-led approach, such as that promoted by SCOPE, is that we are driven by what we truly care about, rather than by the art of the possible, or the size of our dataset. If it’s not possible to rank fairly, in accordance with principles developed by the communities being ranked, we would argue that it is the rankings that need to change, not the principles.
We hope this work initiates some reflection on the part of world university ranking organisations. But we also hope it leads to some reflection by those organisations that set so much store by the world rankings: the universities that seek uncritically to climb them; the students and academics that blindly rely on them to decide where to study or work; and the funding organisations that use them as short-cuts to identify quality applicants. This work provides qualitative and quantitative evidence that the world rankings cannot, currently, be relied on for these things. There is no fair, responsible and meaningful university ranking. Not really. Not yet. There are just pockets of good practice that we can perhaps build on if there is the will. Let’s hope there is.
Courtesy of the pestilence currently scourging our planet, I’ve been able to accept four opportunities to speak this Autumn, as I will be doing so from the comfort of my own home office. For anyone interested in tuning in, I’ve provided the details here and will update this with more intel as I have it.
22-Sep-20 08.30 BST: Finnish Ministry of Education & Culture
Bibliometrics: Diversity’s friend or foe? Assessing research performance using bibliometrics alone does not help create a diverse research ecosystem. But can bibliometrics ever be used to support diversity? And if not, how else can we evaluate what we value about research?
07-Oct-20 17.00 BST: NIH Bibliometrics & Research Evaluation Symposium
The Five Habits of Highly-Effective Bibliometric Practitioners Drawing on ten years’ experience supporting bibliometric and research evaluation practitioner communities, this presentation will highlight five habits of highly effective practitioners providing practical hints and tips for those seeking to support their own communities with robust research evaluation.
15-Oct-20 08.15 BST: 25th Nordic Workshop on Bibliometrics and Research Policy
The Research Evaluation Food Chain and how to fix it. Poor research evaluation practices are the root of many problems in the research ecosystem and there is a need to introduce change across the whole of the ‘food chain’. This talk will consider the challenge of lobbying for change to research evaluation activities that are outside your jurisdiction – such as senior managers and rankings (introducing the work of INORMS REWG), vendors and ‘freemium’ citation-based services.
20-Oct-20 15.00 BST: Virginia Tech Open Access Week
Counting What Counts In Recruitment, Promotion & Tenure. What we reward through recruitment, promotion and tenure processes is not always what we actually value about research activity. This talk will explore how we can pursue value-led evaluations – and how we can persuade senior leaders of their benefits.
Lizzie Gadd gets all fancy talking about algorithms, machine learning and artificial intelligence. And how tools using these technologies to make evaluative judgements about publications are making her nervous.
A couple of weeks ago, The Bibliomagician posted aninteresting piece by Josh Nicholson introducingscite. scite is a new Artificial Intelligence (AI) enabled tool that seeks to go beyond citation counting to citation assessment, recognising that it’s not necessarily the number of citations that is meaningful, but whether they support or dispute the paper they cite.
scite is one of a range of new citation-based discovery and evaluation tools on the market. Some, likeCitation Gecko,Connected PapersandCoCites, use the citation network in creative ways to help identify papers that might not appear in your results list through simple keyword matching. They use techniques like co-citation (where two papers appear together in the same reference list) or bibliographic coupling (where two papers cite the same paper) as indicators of similarity. This enables them to provide “if you like this you might also like that” type services.
I mean, there is an obvious need to understand the nuance of the citation network more fully. The main criticism of citation-based evaluation has always been that citations are wrongly treated as always a good thing. In fact, theCitation Typing Ontologylists 43 different types of citation (including my favourite, ‘is-ridiculed-by’). Although the fact that the majority are positive (<0.6% of citations are negative by scite’s calculations) itself may indicate a skewing of the scholarly record. Why cite work you don’t rate, knowing it will lead to additional glory for that paper? So if we can use new technologies to provide more insight into the nature of citation, this is a positive thing. If it’s reliable. And this is where I have questions. And although I’ve dug into this a bit, I freely admit that some of my questions might be borne of ignorance. So feel free to use the comments box liberally to supplement my thinking.
A bit about the technologies
All search engines usealgorithms(sets of human encoded instructions) to return the results that match our search terms. Some, like Google Scholar, will use the citedness of papers as one element of its algorithm to sort the results in an order that may give you a better chance of finding the paper you’re looking for.And we already know that this is problematic in that it compounds the Matthew Effect: the more cited a paper is, the more likely it will surface in your search results, thereby increasing its chances of getting read and further cited. And of course, the use of more complex citationnetwork analysisfor information discovery can contribute to the same problem: by definition the less cited works are going to be less well-connected and thus returned less often by the algorithm.
Even their developers might not ever really understand what characteristics the AI is identifying in the data as ultimately contributing to the desired outcome.
But it’s the use ofnatural language processing(NLP) to ‘read’ the full text of papers andartificial intelligenceormachine learningto find patterns in the data that concerns me more. So whereas historically humans might provide a long list of instructions to tell computers how to identify an influential paper, ML works by providing a shed load of examples of what an influential paper might look like, and leaving the AI to learn for itself. When the AI gets it right, it gets rewarded (reinforcement learning) and so it goes on to achieve greater levels of accuracy and sophistication. So much so, that even their developers might not ever really understand what characteristics the AI is identifying in the data as ultimately contributing to the desired outcome.
Can you see why am I twitching?
THE (POTENTIALLY) GOOD
The obvious problem is that the assumptions we draw from these data are inherently limited by the quality of the data themselves. So we know that the literature is already hugelybiased towards positive studiesover null and negative results and towardsjournal-based STEM over monograph-based AHSS.So the literature is, in this way, already a biased sample of the scholarship it seeks to represent.
But of course these tools aren’t just indexing the metadata but the full text. So the question I have here is whether Natural Language Processing works equally well on language that isn’t ’natural’ – i.e., where it’s the second language of the author? And what aboutcultural differences in the language of scholarship, where religious or cultural beliefs make expressions of confidence in the results less certain, less self-aggrandising. And I’ll bet you a pound that there are disciplinary differences in the way that papers are described when being cited.
So we know that scholarship isn’t fully represented by the literature. The literature isn’t fully representative of the scholars. The scholars don’t all write in the same way. And of course, some of these tools are only based on a subset of the literature anyway.
At best, this seems unreliable, at worst, discriminatory?
Who makes the rules?
Of course, you may well argue that this is a problem we already face with bibliometrics, asrecently asserted by Robyn Price. I guess my particular challenge with some of these tools is that they go beyond simply making data and their inter-relationships available for human interpretation, to actually making explicit value judgements about those data themselves. And that’s where I think things start getting sticky because someone has to decide what that value (known as the target variable) looks like. And it’s not always clear who is doing it, and how.
If you think about it, being the one who gets to declare what an influential paper looks like, or what a disruptive citation looks like, is quite a powerful position. Oh not right now maybe, when these services are in start-up and some products are in Beta. But eventually, if they get to be used for evaluative purposes, you might end up with the power over someone’s career trajectory. And what qualifies them to make these decisions? Who appointed them? Who do they answer to? Are they representative of the communities they evaluate? And what leverage do the community have over their decisions?
If you think about it, being the one who gets to declare what an influential paper looks like, or what a disruptive citation looks like, is quite a powerful position.
When I queried scite’s CEO, Josh Nicholson, about all this, he confirmed that a) folks were already challenging their definitions of supportive and disruptive citations; b) these challenges were currently being arbitrated by just two individuals; and c) they currently had no independent body (e.g. an ethics committee) overseeing their decision-making – although they were open to this.
And this is where I find myself unexpectedly getting anxious about the birth of free/mium type services based on open citations/text that we’ve all been calling for. Because at least if a commercial product is bad, no-one need buy it, and if you do, as a paying customer you have some* leverage. But I’m not sure if the community will have the same leverage over open products, because, well, they’re free aren’t they? You take them or leave them. And because they’re free, someone, somewhere, will take them. (Think Google Scholar).
*Admittedly not a lot in my experience.
Are the rules right?
Of course, it’s not justwhodefines our target variable buthowthey do it, that matters. What exactly are these algorithms being trained to look for when they seek out ’influential’, ‘supportive’ or ’disruptive’ citations? And does the end user know that? More pertinently, does the developer know that? Because by definition, AI is trained by examples of what is being sought, rather than by human-written rules around how to find it. (There are some alarming stories about early AI-based cancer detection algorithms getting near 100% hit rates on identifying cancerous cells, before the developers realised that it was taking the presence of a ruler on the training images – used by doctors to detect the size of tumours – as an indicator that this was a cancerous cell.)
I find myself asking if someone else developed an algorithm to make the same judgement,wouldit make the same judgement? And when companies like scite talk about their precision statistics (0.8, 0.85, and 0.97 for supporting, contradicting, and mentioning, respectively if you’re interested) to what are they comparing their success rates? Because if it’s the human judgement of the developer, I’m not sure we’re any further forward.
I also wonder whether these products are in danger of obscuring the fact that papers can be ‘influential’ in ways that are not documented by the citation network, or whether these indicators will become the sole proxy for influence – just as the Journal Impact Factor became the sole proxy for impact? And what role should developers play in highlighting this important point – especially when it’s not really in their interests to do so?
Who do the rules discriminate against?
The reason these algorithms need to be right, as I say, is that researcher careers are at stake. If you’ve only published one paper, and its citing papers are wrongly classified as disputing that paper, this could have a significant impact on your reputation. The reverse is true of course – if you’re lauded as a highly cited academic but all your citations dispute your work, surfacing this would be seen as a service to scholarship.
What I’m not clear on is how much of a risk is the former and whether the risk falls disproportionately on members of particular groups. We’ve established that the scientific system is biased against participation by some groups, and that the literature is biased against representation of some groups. So, if those groups (women, AHSS, Global South, EASL-authors) are under-represented in the training data that identifies what an ‘influential’ paper looks like, or what a ‘supporting’ citation looks like, it seems to me that there’s a pretty strong chance they are going to be further disenfranchised by these systems. This really matters.
I’m pretty confident that any such biases would not be deliberately introduced into these systems, but the fear of course, is that systems which inadvertently discriminate against certain groups might be used to legitimise theirdeliberatediscrimination. One group that are feeling particularly nervous at the moment, with the apparent lack of value placed on their work, are the Arts and Humanities. Citation counting tools already discriminate against these disciplines due to the lack of coverage of their outputs and the relative scarcity of citations in their fields. However, we also know that citations are more likely to be used to dispute than to support a cited work in these fields. I can imagine a scenario where an ignorant third-party seeking evidence to support financial cuts to these disciplines could use the apparently high levels of disputing papers to justify their actions.
But it doesn’t stop here. In their excellent paper,Big Data’s Disparate Impact, Barocas and Selbst discuss the phenomenon of masking, where features used to define a target group (say less influential articles) also define another group with protected characteristics (e.g., sex). And of course, the scenario I envisage is a good example of this, as the Arts & Humanities are dominated by women. Discriminate against one and you discriminate against the other.
The thin end of the wedge.
All this may sound a bit melodramatic at the moment. After all these are pretty fledgling services, and what harm can they possibly do if no-one’s even heard of them? I guess my point is that the Journal Impact Factor and the h-index were also fledgling once. And if we’d taken the time as a community to think through the possible implications of these developments at the outset, then we might not be in the position we are in now, trying to extract each mention of the JIF and the h-index from the policies, practices and psyches of every living academic.
I guess my point is that the Journal Impact Factor and the h-index were also fledgling once.
Indeed, the misuse of the JIF is particularly pertinent to these cases. Because this was a ‘technology’ designed with good intentions – to help identify journals for inclusion in the Science Citation Index – just as scite and Semantic Scholar are designed to aid discovery and citation sentiment. But it was a very small step between the development of that technology and its ultimate use for evaluation purposes. We just can’t help ourselves. And we are naïve to think that just because a tool was designed for one purpose, that it won’t be used for another.
This is why theINORMS SCOPE model, insists that evaluation approaches ‘Probe deeply’ for unintended consequences, gaming possibilities and discriminatory effects. It’s critical. And it’s so easy to gloss over when we as evaluation ‘designers’ know that our intentions are good. I’ve heard that scite are now moving on to provide supporting and disputing citation counts for journals, which we’ll no doubt see on journal marketing materials soon. How long before these citations start getting aggregated at the level of the individual?
Of course, the other thing that AI is frequently used for, once it has been trained to accurately identify a target variable, is to then go on topredictwhere that variable might occur in future. Indeed we are already starting to see this with AI-driven tools likeMeta Bibliometric IntelligenceandUNSILO Evaluate, where they are using the citation graph to predict which papers may go on to be highly cited and therefore a good choice for a particular journal. To me, this is hugely problematic and a further example of the Matthew Effect seeking to reward science that looks like existing science rather than ground-breaking new topics, written by previously unknowns. Do AI-based discovery and evaluation tools have the potential to go the same way, predicting based on past performance, the more influential scholars of the future?
I don’t want to be a hand-wringing nay-sayer, like an old horse-and-cart driver declaring the automobile the end of all that is holy. But I’m not alone in my handwringing. Big AI developer,DeepMind, are taking this all very seriously. A key element of their work is around Ethics & Society including a pledge to use their technologies for good. They were one of the co-founders of thePartnership on AIinitiative where those involved in developing AI have an open discussion forum, including members of the public, around the potential impacts of AI and how to ensure they have positive effects. The Edinburgh Futures Institute have identified Data & AI Ethics as a key concern and are running free short courses inData Ethics, AI & Responsible Research & Innovation. There are also initiatives such asExplainable AIwhich recognise the need for humans to understand the process and outcomes of AI developments.
I’ve no doubt that AI can do enormous good in the world, and equally in the world of information discovery and evaluation. I feel we just need to have conversations now about how we want this to pan out, fully cognisant of how it might pan out if left unsupervised. It strikes me that we might do well to develop a community agreed voluntary Code of Practice for working with AI and citation data. This would ensure that we get to extract all the benefits from these new technologies without finding them being over-relied upon for inappropriate purposes. And whilst such services are still in their infancy I think it might be a good time to have this conversation. What do you think?
I am grateful to Rachel Miles, Josh Nicholson, David Pride for conversations and input to this piece, and especially thankful to Aaron Tay who indulged in a long and helpful exchange that made this a much better offering.
Elizabeth Gadd is the Research Policy Manager (Publications) at Loughborough University. She is the chair of the Lis-Bibliometrics Forum and co-Champions the ARMA Research Evaluation Special Interest Group. She also chairs the INORMS International Research Evaluation Working Group.
This blog post by Lizzie Gadd was first published on the WonkHE Blog on 2 July 2020.
Among all the recently research-related news, we now know that UK universities will be making their submissions to the Research Excellence Framework on 31 March 2021.
And aseries of proposalsare in place to mitigate against the worst effects of COVID-19 on research productivity. This has led to lots of huffing and puffing from research administrators about the additional burden and another round of ‘What’s the point?’ Tweets from exasperated academics. And it has led me to reflect dreamily again about alternatives to the REF and whether there could be a better way. Something thatUKRI are already starting to think about.
One of the research evaluation approaches I’veoften admiredis that of the Dutch Standard Evaluation Protocol (SEP). So when I saw that the Dutch had published the next iteration of theirnational research evaluation guidance, I was eager to take a look. Are there lessons here for the UK research community?
I think so.
The first thing to say of course, is that unlike REF, the Dutch system is not linked to funding. This makes a huge difference. And the resulting freedom from feeling like one false move could plummet your institution into financial and reputational ruin is devoutly to be wished. There have been many claims – particularly at the advent of COVID-19 – that theREF should be abandonedand some kind of FTE-based orcitation-basedalternative used to distribute funds. Of course theargument was quickly madethat REF is not just about gold, it’s about glory, and many other things besides. Now I’m no expert on research funding, and this piece is not primarily about that. But I can’t help thinking, what if REF WAS just about gold? What if it was just a functional mechanism for distributing research funds and the other purposes of REF (of which there arefive) were dealt with in another way? It seems to me that this might be to everybody’s advantage.
And the immediate way the advantage would be felt perhaps, would be through a reduction in the volume and weight of guidance. TheSEPis only 46 pages long (including appendices) and, perhaps with a nod to their general levity about the whole thing, is decorated with flowers and watering cans. TheREF guidanceon the other hand, runs to 260 pages. (124 pages for the Guidance on Submissions plus a further 108 pages for the Panel Criteria and Working methods and 28 pages for the Code of Practice – much of which cross-refers and overlaps).
And if that’s not enough to send research administrators into raptures, the SEP was publishedone year priorto the start of the assessment period. Compare this to the REF where the first iteration of the Guidance on Submissions was publishedfive years intothe assessment period, and where fortnightly guidance in the form of FAQs continues to be published, and where we are still yet to receive some of it months before the deadline.
Of course, I understandwhythe production of REF guidance is such an industry: it’s because they are enormously consultative, and they are enormously consultative because they want to get it right, and they want to get it right because there is a cash prize. And that, I guess, is my point.
But it’s not just the length of course, it’s the content. If you want to read more about the SEP, you can check out their guidancehere.It won’t take you long – did I say it’s only 46 pages? But in a nutshell: SEP runs on a six-yearly cycle and seeks to evaluate research units in light of their own aims to show they are worthy of public funding and to help them do research better. It asks them to complete a self-evaluation that reflects on past performance as well as future strategy, supported by evidence of their choosing. An independent assessment committee then performs a site visit and has a conversation with the unit about their performance and plans, and provides recommendations. That’s it.
Measure by mission
The thing I love most about the new SEP is that whilst the ‘S’ used to stand for ‘Standard’, it now stands for ‘Strategy’. So unlike REF where everyone is held to the same standard (we are all expected to care 60% about our outputs, 15% about our research environment and 25% about real-world impact), the SEP seeks to assess units in accordance with their own research priorities and goals. It recognises that universities are unique and accepts that whilst we all love to benchmark, no two HEIs are truly comparable. All good research evaluation guidance begs evaluators to start with themissionandvaluesof the entity under assessment. The SEP makes good on this.
And of course the benefit of mission-led evaluation is that it takes all the competition out of it. There are no university-level SEP League tables, for example, because they seem to have grasped that you can’t rank apples and pears. If we really prize a diverse ecosystem of higher education institutions, why on earth are we measuring them all with the same template?
Realistic units of assessment
In fact, I’m using the term ‘institutions’ but unlike the REF, the SEP at no time seeks to assess at institutional level. They seek only to assess research at the level that it is performed: the research unit. And the SEP rules are very clear that “the research unit should be known as an entity in its own right both within and outside of the institution, with its own clearly defined aims and strategy.”
So no more shoe-horning folks from across the university into units with other folks they’ve probably never even met, and attempting to create a good narrative about their joined-up contribution, simply because you want to avoid tipping an existing unit into the next Impact Case Study threshold. (You know what I’m talking about). These are meaningful units of assessment and the outcomes can be usefully applied to, and owned by, those units.
Evaluate with the evaluated
And ownership is so important when it comes to assessment. One of the big issues with the REF is that academics feel like the evaluation is donetothem, rather thanwiththem. They feel like the rules are made up a long way from their door, and then taken and wielded sledge-hammer-like by “the University”, AKA the poor sods in some professional service whose job it is to make the submission in order to keep the research lights on for the unsurprisingly ungrateful academic cohort. It doesn’t make for an easy relationship between research administrators and research practitioners.
Imagine then if we could say to academic staff, we’re not going to evaluate you any more, you’re going to evaluate yourselves. Here’s the guidance (only 46 pages – did I say?) off you go. Imagine the ownership you’d engender. Imagine the deep wells of intrinsic motivation you’d be drawing on. Indeed, motivational theory tells us that intrinsic motivation eats extrinsic motivation for breakfast. And that humans are only ever really motivated by three things: autonomy, belonging and competence. To my mind, the SEP taps into them all:
Autonomy: you set your own goals, you choose your own indicators, and you self-assess. Yes, there’s some guidance, but it’s a framework and not a straight-jacket and if you want to go off-piste, go right ahead. Yes, you’ll need to answer for your choices, but they are still your choices.
Belonging: the research unit being assessed is the one to which you truly belong. You want it to do well because you are a part of this group. Its success and its future is your success and your future.
Competence: You are the expert on you and we trust that you’re competent enough to assess your own performance, to choose your own reviewers, and to act on the outcomes.
The truth will set you free
One of the great benefits of being able to discuss your progress and plans in private, face-to-face, with a group of independent experts that you have a hand in choosing, is that you can be honest. Indeed, Sweden’s Sigridur Beck from Gothenburg University confirmed this when talking about their institution-led research assessment at a recentE-ARMA webinar. She accepted that getting buy-in from academics was a challenge when there was nothing to win, but that they were far more likely to be honest about their weaknesses when there was nothing to lose. And of course, with the SEP you have to come literally face-to-face with your assessors (and they can choose to interview whoever they like) so there really is nowhere to hide.
The problem with REF is that so much is at stake it forces institutions to put their best face on, to create environment and impact narratives that may or may not reflect reality. It doesn’t engender cold, hard, critical self-assessment which is the basis for all growth. With REF you have to spin it to win it. And it’s not just institutions that feel this way. I’ve lost count of the number of times I’ve heard it said that REF UoA panels are unlikely to score too harshly as it will ultimately reflect badly on the state of their discipline. This concerns me. Papering over the cracks is surely never a good building technique?
Formative not summative
Of course the biggest win from a SEP-style process rather than a REF-style one is that you end up with a forward-looking report and not a backward-looking score. It’soften struck me as ironicthat the REF prides itself on being “a process of expert review” but actually leaves institutions with nothing more than a spreadsheet full of numbers and about three lines of written commentary. Peer review in, scores out. And whilst scores might motivate improvement, they give the assessed absolutely zero guidance as to how to make that improvement. It’s summative, not formative.
The SEP feels truer to itself: expert peer review in, expert peer review out. And not only that but “The result of the assessment must be a text that outlines in clear language and in a robust manner the reflections of the committee both on positive issues and – very distinctly, yet constructively – on weaknesses” with “sharp, discerning texts and clear arguments”. Bliss.
Proof of the pudding
I could go on about the way the SEP insists on having ECRs and PhD students on the assessment committee; and about the way units have to state how they’re addressing important policy areas like academic culture and open research; and the fact that viability is one of the three main pillars of their approach. But you’ll just have to read the 46-page guidance.
The proof of the pudding, of course, is in the eating. So how is this loosey-goosey, touchy feely approach to research evaluation actually serving our laid-back low-country neighbours?
Pretty well actually.
Theefficiency of research fundingin the Netherlands is top drawer. And whichever way you cut the citation data, the Netherlands significantly outperforms the UK. According to SciVal, research authored by those in the Netherlands (2017-2019) achieved a Field Weighted Citation Impact of 1.76 (where 1 is world average). The UK comes in at 1.55. And as far as I can see, the only countries that can hold a candle to them are Denmark, Sweden and Switzerland – none of which have a national research assessment system.
It seems to me that we have so much to gain from adopting a SEP-style approach to research evaluation. In a post-COVID-19 world there is going to be little point looking back at this time in our research lives and expecting it to compare in any way with what’s gone before. It’s time to pay a lot less attention to judging our historical performance, and start thinking creatively about how we position ourselves for future performance.
We need to stop locking our experts up in dimly lit rooms scoring documentation. We need to get them out into our universities to meet with our people, to engage with our challenges, to breathe our research air, and to collectively help us all to be the best that we can be – whatever ’best’ may look like for us. I believe that this sort of approach would not only dramatically reduce the burden (I’m not sure if I said, but the SEP is only 46 pages long), but it would significantly increase buy-in and result in properly context-sensitive evaluations and clear road-maps for ever-stronger research-led institutions in the future.
Frankly, I don’t want to come out of REF 2027 with another bloody spreadsheet, I want us to come out energised having engaged with the best in our fields, and positioned for the next six years of world-changing research activity.