Lack of backups threatens the scholarly record

A quarter of online journal articles are not digitally archived, warns Martin Eve

The past 20 years have seen academic journals move almost entirely from print to digital. It is a change that has enhanced the reach of academic research and spurred the open-access movement to demand the full, free delivery of intellectual work to interested readers.

However, this digital conversion has also come with a challenge for longevity. The internet is, after all, a paradoxical space where things disappear and links degrade, even while it may be impossible to remove something embarrassing or untrue.

We certainly have the technologies to preserve digital materials, even if their security has never been tested in anger.

The options include specialist digital archives such as Lockss (Lots of Copies Keep Stuff Safe), Clockss (Controlled Lockss)
and Portico, each of which is run by an alliance of research libraries and publishers. They are known as dark archives, which release their content only when the original disappears.

Alongside these, the Internet Archive continues its venerable efforts to preserve digital culture, while digital legal deposit at the British Library and other institutions offer comfort.

How much is actually saved, though, was unknown. In my role as principal R&D developer at the digital infrastructure organisation Crossref, I recently assessed the state of preservation in digital journals. The findings are alarming.

I analysed almost 7.5 million randomly chosen journal articles and ascertained where they were preserved. While 58 per cent were in at least one of the eight archives on which I focused, almost 28 per cent appeared to have no preservation at all.

This seriously jeopardises long-term access for a significant proportion of the scholarly record. Indeed, the Digital Object Identifier system, intended to provide a permanent link for scholarly material, relies on publishers retaining archival copies, to which a DOI can be redirected if the publisher fails.

Shifting responsibility

Part of the problem seems to stem from a confusion. In the print era, preservation was the responsibility of collecting libraries. At this time, preservation meant keeping materials in a usable condition through temperature- and humidity-controlled archives.

In the digital age, this responsibility has shifted to publishers, but not all of them seem to realise this. Indeed, Crossref’s terms of membership include a commitment for publishers to make best efforts to preserve registered material in a third-party archive.

Hence, as part of this research, I sought to understand what distinguishes the publishers doing a good job of digital preservation from those that are lax in their duties. The results were perhaps unsurprising.

The largest and wealthiest publishers—Elsevier, Informa, Wiley and so forth—tend to have robust preservation, with almost all of their material deposited in three or more dark archives.

Meanwhile, smaller publishers with less revenue tend to have less robust preservation cultures. This is not universally true; some smaller, even scholar-led, publishers do an admirable job of digital preservation. On the whole, though, much more of this material is at risk.

This suggests that the uneven distribution of wealth in the scholarly publishing industry, with a few very wealthy publishers, many impoverished ones and few in between, is placing material at risk of disappearing.

I also looked at whether publishers based in different places performed differently, but the available metadata on location and country did not lead to any solid conclusions. The size of a publisher, judged by revenue, is a far better indicator of preservation procedures than where it is based.

Educating publishers

This analysis is not a complete picture. I used only a subset of archives, only tracked articles with DOIs and did not investigate institutional repositories. Nonetheless, as an initial attempt to gauge the landscape, the results are a canary in the coalmine. And the canary is definitely dead.

Where do we go from here? Clearly, much more education is needed among publishers on the important matter of digital preservation. If a footnote cites an item that nobody can read because it has gone offline forever, then the entire epistemology of contemporary research is at risk.

We also need, though, to think about the resourcing for digital preservation and find ways to level the playing field.

There is, of course, always a cost in preserving material indefinitely. The uneven revenue distribution in scholarly communications should not be a barrier to the robust, long-term sustainability of digital material. Crossref is planning a project to help less well-resourced publishers safeguard more of the scholarly record.

Martin Eve is principal R&D developer at Crossref and professor of literature, technology and publishing at Birkbeck, University of London

This article also appeared in Research Fortnight