Digital decay and the archival cloud

Throughout human history, the documentation of events and thoughts usually required a good deal of time and effort. Somebody had to sit down with a stylus or a pen or, later, a typewriter or a tape recorder, and make a deliberate recording. That happened only rarely. Most events and thoughts vanished from memory, individual and collective, soon after they occurred. If they were described or discussed at all, it was usually in conversation, face to face or over a phone line, and the words evaporated as they were spoken.

That’s all changed now. Thanks to digital tools, media, and networks, recording is easy, cheap, and often automatic. Hard drives, flash drives, CDs, DVDs, and other storage devices brim with audio, video, photographic, and textual recordings. Evidence of even the most trivial of events and thoughts, communicated through texts, posts, status updates, and tweets, is retained in the data centers of the companies that operate popular Internet sites and services.

We live, it seems, in a golden age of documentation. But that’s not quite true. The problem with making a task cheap and effortless is that the results of that task come to be taken for granted. You care about work that’s difficult and expensive, and you want to preserve its product; you don’t pay much attention to the things that happen automatically and at little or no cost. In Avoiding a Digital Dark Age, an article appearing in the new edition of American Scientist, Kurt Bollacker, of the Long Now Foundation, expertly describes the conundrum of digital recording: everything’s documented, but the documents don’t last. The problem stems from the fact that, with digital recordings, we don’t only have to preserve the data itself; we have to preserve the devices and techniques used to read the data and output it in a form we can understand. As Bollacker writes:

With most analog technologies such as photographic prints and paper text documents, one can look directly at the medium to access the information. With all digital media, a machine and software are required to read and translate the data into a human-observable and comprehensible form. If the machine or software is lost, the data are likely to be unavailable or, effectively, lost as well.

The problem is magnified by the speed with which old digital media and recording techniques, including devices and software, are replaced by new ones. It’s further magnified by the fact that even modest damage to a digital recording can render that recording useless (as anyone who has scratched a CD or DVD knows). In contrast, damage to an analog recording – a scratch in a vinyl record, a torn page in a book – may be troublesome and annoying, but it rarely renders the recording useless. You can still listen to a scratched record, and you can still read a book with a missing page. Analog recordings are generally more robust than digital ones. As Bollacker explains, history reveals a clear and continuing trend: “new media types tend to have shorter lifespans than older ones, and digital types have shorter lifespans than analog ones.” The lifespan of a stone tablet was measured in centuries or millennia; the lifespan of a magnetic tape or a hard drive is measured in years or, if you’re very lucky, decades.

After describing the problem, Bollacker goes on to provide a series of suggestions for how digital recordings could be made more robust. The suggestions include applying better error correction algorithms when recording data and being more thoughtful about the digital formats and recording techniques we use. None of the recommendations would be particularly difficult to carry out. What’s required more than anything else is that people come to care about the problem. Apathy remains the biggest challenge in combating digital decay.

But there’s a new wrinkle to this story, and it’s one that Bollacker doesn’t address in his article: the cloud. Up to now, there has been one characteristic of digital recordings that has provided an important counterweight to the fragility of digital media – it’s what Bollacker refers to as “data promiscuity.” Because it’s easy to make copies of digital files, we’ve tended to make a lot of them. The proliferation of perfect digital copies has provided an important safeguard against the loss of data. An MP3 of even a moderately popular song will, for instance, exist on many thousands of computer hard drives as well as on many thousands of iPods, CDs, and other media. The more copies that are made of a recording, and the more widely the copies are dispersed, the more durable that recording becomes.

By centralizing the storage of digital information, cloud computing promises to dramatically reduce data promiscuity. When all of us are able to, in effect, share a copy of a digital file, whether a song or a video or a book, then we don’t need to make our own copies of that file. Cloud computing replaces the download with the stream, and that means that, as people come to use the cloud as their default data store, we’ll have fewer copies of files and hence less of the protection that multiple copies provides. Indeed, in the ultimate form of cloud computing, you’d need only a single copy of any digital recording.

Apple’s new iPad, which arrived with much fanfare over the weekend, provides a good example of where computing is heading. The iPad is much more of a player than a recorder. It has a much smaller storage capacity than traditional desktops and laptops, because it’s designed on the assumption that more and more of what we do with computers will involve streaming data over the Net rather than storing it on our devices. The iPad manifests a large and rapidly accelerating trend away from local, redundant storage and toward central storage. In fact, I’d bet that if you charted the average disk size of personal computers, including smartphones, netbooks and tablets as well as laptops and desktops, you would discover that in recent years it has shrunk, marking a sea change in the history of personal computing. An enormous amount of digital copying and local storage still goes on, of course, but the trend is clear. Streaming will continue to replace downloading, and the number of copies of digital recordings will decline.

The big cloud computing companies take the safeguarding of data very seriously, of course. For them, loss of data means loss of business, and catastrophic data loss means catastrophic business loss. A company like Google stores copies of its files in many locations, and it takes elaborate steps to protect its data centers and systems. Nevertheless, one can envision a future scenario (unlikely but not impossible) involving a catastrophe – natural, technological, political, or even commercial – that obliterates a cloud operator’s store of data. More prosaically, companies go out of business, change hands, and alter their strategies and priorities. They may not always care that much about data that once seemed very important, particularly data that has lost its commercial value. A business exists to make money, not to run an archive in perpetuity. Seen in this light, our embrace of the cloud may have the unintended effect of making digital recordings even more fragile, especially over the long run.

As digital recordings displace physical ones, the risks expand. Think about books. Google’s effort to scan every physical book ever published into its database has been compared to the creation of the great library of Alexandria. Should Google (or another organization) succeed in creating an easy-to-use, universally available store of digital books, we might well become dependent on that store – and take it for granted. We would stream books as we today stream videos. In time, we would find fewer and fewer reasons to maintain our own digital copies of books inside our devices; we would keep our e-books in the cloud. We would also find it increasingly hard to justify the cost of keeping physical copies of books, particularly old ones, on shelves, either in our homes or in libraries.

At that point, if we hadn’t been very, very careful in how we developed and maintained our great cloud library, we would be left with few safeguards in the event that, for whatever foreseeable or unforeseeable reason, that library was compromised or ceased to function. We all know what happened to the library of Alexandria.

11 Comments

Filed under Uncategorized

11 Responses to Digital decay and the archival cloud

  1. Michael Verni

    An interesting perspective, as always.

    One thing that crossed my mind while reading: if there are thousands of copies of a book (or any other media) distributed all over the world, any intrusive later editing requires serious effort.

    On the other hand, if everybody is reading a streaming, temporary version of a single centralized book, a mere cloud cache that will disappear from the client in a week or a month – who’s to say that 10 years later the master copy won’t silently start saying we’re at war with Eurasia?

  2. Thanks for the thought provoking read… One of the biggest concerns for me with regards to streaming replacing downloading is the small question of who ends up ‘owning’ the content. Google doesn’t own a lot of the content that comes up in their search results (although they may want to), but they do have control over what content you see first and as a result they have the power to influence a lot of people. Doesn’t this same danger exist if we talk about creating the ultimate cloud? What if that library was not compromised and didn’t cease to function, but rather we were held to ransom for it? Do you see the world of streaming operating in the same way the traditional web operates. Will I still browse whatever I want, but just stream it and not download it?

    Apologies if my questions seem obvious, but it’s bugging me enough now that I need to ask.

  3. You remind me of a famous quote from Linus Torvalds, “Real men don’t use backups, they post their stuff on a public ftp server and let the rest of the world make copies.” I always thought that was a very poor solution.

    Your essay hints at one of my pet peeves, the archival problems of the digital world are creeping into analog media, as digital techniques of production are being adapted to those media. You mention photography and that is the issue at hand. Current inkjet printing methods aim for the “archivality” (permanence) of conventional photo processes. Unfortunately, these methods use such infinitesimal amounts of pigment compared to conventional analog techniques, that even the most permanent inkjet prints have longevity estimated in years, while photographic prints’ longevities are estimated in centuries. I often tell inkjet users that in a hundred years or so, their prints will be almost blank sheets of paper, their images faded to nothingness, while images produced ten years ago with analog methods will still be pristine. It is like an entire generation of digital photography has opted to remove itself from the future of art history.

    In response, fine art inkjet users often say, it’s a digital print, just issue a CDROM of the print as a backup, and if the print fades, make a new one. It should be obvious what is problematic about the idea of digital backups on CDs intended to outlive printed media.

    Art history is currently a lottery, the few artworks that managed to survive war, fire, flood, disaster, discarding, and decomposition, are the only examples we have to study. Reducing the odds of a physical print’s survival by handicapping its permanence seems to be a big step backwards I don’t see digital data archiving as a solution to this problem.

  4. I’m not sure I agree that this is a device or sharing or duplication issue. The issue is similar to the one stated so well in Harry Nilson’s “The Point” in the early ’70’s: having a point in every direction is the same as having no point at all.

    The simple fact that there was a barrier to content creation in times past meant that, in general, if someone surmounted the barrier then there was a greater likelihood that the content was interesting (when it was discovered years later).

    When all content is always available it will require even greater editorial skills to discern a given society’s truly interesting and defining content. Anthropologists will use different content to understand our culture… and I propose that the most interesting content will be the non-digital artifacts we leave behind.

  5. The point with regard to photo print permanence is only partially correct. B&W silver prints, properly rinsed, have lasted for a century so far, and will likely last for a few more. Color prints and slides are another matter. Only Kodachrome slides are known to be largely fade-free. All other methods last about a decade without fading, and Kodachrome is essentially no more (some rolls and one processor existed last I looked). Conventional color prints don’t last. Dye transfer color prints appear to be very durable.

  6. Kelly Roberts

    There was a really interesting article in the NYT about the preservation of the “born-digital” files of writers, including Rushdie and Updike (the latter sent a number of floppy disks to Harvard right before he died). They don’t really know what to do with all this stuff yet, other than store it in a cool place. Fingers crossed!

    Full article is here:

    http://www.nytimes.com/2010/03/16/books/16archive.html?pagewanted=1&ref=books

  7. Tom Lord

    Nick, some notes:

    First about libraries and then about tech. and tech business politics:

    People who do the main work of making libraries function (librarians, non-library executives, and library paraprofessionals) have wrestled with these issues for years in multiple ways, sometimes coming up with contradictory answers.

    One perennial concern has been the cataloging of on-line resources, in spite of the fact that many interesting resources have no stable identifiers or archivable content. As an example, a library may “hold” (so to speak) in its collection an on-line only academic journal or research database. What this often actually means is that the library (or its host organization) owns an authentication key that permits access to a remote archive at some URL. So the libraries catalog will record (for example) “Journal of Gobbledygook” with an annotation “on-line only” and a URL. The problem, of course, is that what the library actually holds in its archived collection is something like a password or a contract with the publisher – not the journal itself. There are precious few guarantees that the cataloged resource will remain accessible in the same way that a book in the stacks will remain accessible. It’s a bit like saying you own a tuxedo when all you actually possess is a gift certificate, with an expiration date, from a particular tux rental place. (In the case of non-commercial on-line resources that cataloged, you don’t even have the gift certificate – just a URL and a description of what could be found there the last time anyone updating the catalog happened to check.)

    Sometimes libraries make an effort to collect and archive local copies of digital materials. For the most part, the efforts to protect and preserve the local archive are quite weak compared to physical materials on the stack. And, as you say, there are deep difficulties ensuring that the software to display these archived materials will persist in the future. Additional problems include the lack of automation software to aid in such archival tasks – it often comes down to the grunt work of maintaining a “list of things we need to download” and a “list of things we’ve downloaded” – all by hand. And, unlike books, those free-floating digital forms lack widely used stable identifiers like official titles and edition numbers or ISBNs. So it’s a big, ad hoc, labor-intensive mess.

    You say that perhaps over time we’ll find less and less reason to maintain physical archives. I think you know but I’ll mention that that’s already started. There is a sub-faction within the library field who are eager to remove stacks from certain library spaces and replace them with computer terminals. It can be very cost-effective and enabling – short-term – in many situations.

    That said, about tech and tech politics: It would *not* be easy other than conceptually to build out the system software infrastructure for wide-spread publication of digital materials in archive-suitable formats, with materials generally having stable, location-independent identifiers akin to titles and editions, etc. It’s a problem that today’s big businesses have little incentive to address because it implies pushing out all of your valuable content in a de-facto non-rival form that is *more useful* than the centralized-streaming form you currently use. In the world of software freedom advocacy quite a few of us have been puzzling for a few years about how to make the conceptually simple distributed and decentralized form of publication viable – consistently getting stuck on points where we know what needs to be built but can’t raise the capital investment to build it.

    Finally, if you want to look into one of the darker aspects of this circumstance, consider that the problem of archival is one that can be solved in large part – even against today’s tech – with sufficient investment. And then wonder how much of that investment will be closely and privately held. That is, today’s circumstance leads to a situation where a few elite know history much better than everyone else. And you don’t even get to know that they know. It can get a little weird when, for example, a big powerful entity remembers far more about your life than do you, yourself.

  8. “drcodd,” I’m speaking of fine art processes, and dye transfer is the minimum standard. I work in an obscure color process with no known upper bound on longevity, they will last as long as the paper they are printed on. I’ve seen 150 year old images of this type that are exceptionally vivid.

    At this point, even the worst of the fine-art analog photo printing methods will outlast even the best digital printing methods. Even the best inkjet prints can only hope to last as long as the fragile Talbotypes I’ve seen.

  9. Mike Snow

    I’m continually amazed at how good sci-fi writers are at predicting the future. The whole idea (along with may others) of copies, back-ups etc. is dealt with wonderfully by Charles Stross in “Accelerando”.

    Want to see the future, the objectives are being documented in sci-fi literature right now.

  10. AOL’s announcement that they may shut down Bebo is a perfect example of digital/cloud decay. All those users’ profiles, full of content, could disappear if the company doesn’t find a buyer. Just like that.

  11. The problem of changing digital formats and their archival value is already visible in some specialised fields.

    For instance digital sound recordings made onto tape. I first read about this issue in an interview with Steve Albini back in 1997. Even at that time there was apparently a substantial number of master recordings made on deprecated proprietary tape formats. The machines were no longer being made and thus the recordings were effectively unplayable (unless one could locate a working machine and solve potential tracking problems).

    Digital recording has shifted onto hard-disks and now uses common formats, but that only shifts the decay horizon to maybe a couple of decades (as opposed to several years, which was the case in the scenario described by Albini).