The challenge of measuring open research data

This post was originally published on The Bibliomagician Blog on 24 March 2021

Lizzie Gadd & Gareth Cole discuss the practical challenges of monitoring progress towards institutional open research data ambitions.

Loughborough University has recently introduced a new Open Research Position Statement which sets out some clear ambitions for open access, open data and open methods. As part of this work we’re looking at how we can monitor our progress against those ambitions. Of course, for open access, we’re all over it. We have to count how many of our outputs are available openly in accordance with various funder policies anyway. But there are no equivalent demands for data. OK, all data resulting from some funded projects need to be made available openly, but no-one’s really counting – yet. And anyway, our ambitions go beyond that – we’d like to encourage all data to be made available openly and ‘FAIRly’ where possible.

So how do we measure that? Well, with difficulty it would seem. And here’s why: 

Equivalence

In the world of journal articles, although there are disciplinary differences as to the number of articles that are produced, every article is roughly equivalent in size and significance to another. Data are not like that. What qualifies as a single unit of data, thus receiving its own metadata record, might be a photograph or a five-terabyte dataset. So it would be a bit unfair to compare the volume of these. And there is currently no agreement as to ‘how much data’ (in size, effort, or complexity) there needs to be to qualify for a unique identifier.

What qualifies as a single unit of data, thus receiving its own metadata record, might be a photograph or a five-terabyte dataset.

But it’s not just what counts as a unit of data but what counts as a citable unit of data that differs. A deposit of twenty files could have one DOI/Identifier or twenty DOIs depending on how it is split up. This means that potentially there could be citation advantages or disadvantages for those that deposit their data in aggregate or individually – but this would entirely depend on how the citer chooses to cite it.

Source

For journal articles, full-text versions are duplicated all over the place. The same article might be available on multiple repositories, pre-print servers and the publisher’s site. In fact, whilst there are concerns about version control, there are many benefits to such duplicates in terms of discovery and archiving (Lots of Copies Keeps Stuff Safe [LOCKSS] and all that). But for data, it’s not good practice to duplicate in different repositories. This is both for versioning reasons (if you update in one place, you then have to update in others) and for DOI reasons (two instances usually means two DOIs which means that any data citations you get will be split across two sources). 

So if we wanted to identify all the data produced by Loughborough academics, we’d have a pretty difficult job doing it. Some will be on our repository, but other data will be spread across multiple different archives. Data citation and linking services such as Datacite and Scholix may ultimately offer a solution here of course, but as others have noted, these have a long way to go before they are truly useful for monitoring purposes. Datacite only indexes data with a Datacite DOI. And Scholix only surfaces links between articles and datasets, not the independent existence of data.  Some services, such as Dimensions, only index records that have an item type of “Dataset”. This means that data taking other forms such as “media” or “figure” won’t appear in Dimensions, thus disenfranchising those researchers who use “data” but not “datasets”.

But the biggest problem these services face is that they rely on consistent high-quality metadata collection, curation and sharing by all the different data repositories.

But the biggest problem these services face is that they rely on consistent high-quality metadata collection, curation and sharing by all the different data repositories. And we’re just not there yet. (Although all repositories which mint DataCite DOIs will need to comply with the minimum/mandatory requirements to mint the DOI in the first place). And a particular problem for institutions seeking to measure their own dispersed data output is that few repositories expose author affiliation data even where they do comprehensively collect it. And this leads us on to our third point.

Authorship

The authorship of journal articles is increasingly controlled and subject to much guidance. Learned societies provide guidance, journals provide guidance, institutions sometimes have their own guidance. The CRediT taxonomy (whilst not without problems) was introduced to make it absolutely explicit as to who did what on a journal article. The same is not usually true of data.

Of course, data is created rather than authored as the DataCite schema makes clear. But there is no way of ensuring that all the data creators have been added to the metadata record, even if the depositors do always know who they are. And whilst there is no real glory associated with data ownership, this problem isn’t going to be quickly resolved. As with journal articles, often the list of contributors is likely to be very long so there needs to be some incentive to do this carefully and well.

And whilst there is no real glory associated with data ownership, this problem isn’t going to be quickly resolved.

This is where we butt up against Professor James Frew‘s two laws of metadata:

  1. Scientists don’t write metadata;
  2. Any scientist can be forced to write bad metadata.

There seems to be scope for some CRediT-type contributor guidance for data to ensure all the right people get the right credit. (Maybe a Research Data Alliance Working Group?) And then there needs to be some motivation for depositors to stick to it.

Quality assurance

Although the standard of journal peer review is variable and hotly contested as a mechanism to signify quality, at least all journal articles are subject to some form of it prior to publication. Data are not currently peer reviewed (unless submitted as a data paper or if the dataset is provided as supplementary information to a journal submission). And although data can be cited, this appears to be still comparatively quite rare. This may partly be due to the challenge of ‘counting’ data citations due to huge variations in citation quality, whether data is cited at all (or added as part of a data availability statement), and disciplinary differences in the way this is done. And there is a big difference between a data citation which just states ‘this data is available and relevant to my study’ and a data citation which signifies that ‘the data in question has actually been re-analysed or repurposed in my study’. But a data citation doesn’t currently differentiate between the two.

Of course, the accepted standard for data quality is the FAIR principles: data should be Findable, Accessible, Interoperable and Reusable. But despite many investigations (see: https://fairsharing.org/ , https://www.fairsfair.eu/ and https://github.com/FAIRMetrics/Metrics) into the best way of assessing how FAIR data is, the average institutional data repository has no easy way of quickly identifying this.

There is also the challenge that FAIR data may not be open data, and vice versa. Some data can never be open data for confidentiality reasons. So in our attempt to pursue open research practices, and given a choice, what do we count? Open data that may not be FAIR? FAIR data that may not be open? Or only data that are both? And if so, how fair is that?

Summary

So where does this leave us at Loughborough? Well, in a less than satisfactory situation to be honest. We could look at the number of data deposits (or depositors) in our Research Repository per School over time to give us an idea of growth. But this will only give us a very partial picture. We could do a similar count of research projects on ResearchFish with associated data, or e-theses with related data records, but again, this will only give us a small window onto our research data activity. Going forward we might look at engagement with the data management planning tool, DMP Online, over time, but again this is likely to shine more light on disciplines that have to provide DMPs as part of funding applications and PhD studies. 

So, whilst we can encourage individuals to deposit data, and require narrative descriptions of their engagement with this important practice for annual appraisals, promotions, and recruitment, we have no meaningful way of monitoring this engagement at department or University-level. And as for benchmarking our engagement with that happening elsewhere, this currently feels like it’s a very long way off.

The big fear of course, is that this is where commercial players rock up and offer to do this all for us – for a price. (Already happening). And given that data is a research output not currently controlled by such outfits, it would be a very great shame to have to pay someone else to tell us about the activities of our own staff. In our view.

Really hoping that some clever, community-minded data management folk are able to help with this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: