Linkrot poses serious threat to Web archives

NET RESULTS: Like many people who get an email account, I was intrigued by the chunk of server space for hosting a personal …

NET RESULTS: Like many people who get an email account, I was intrigued by the chunk of server space for hosting a personal website that came with it. Five megabytes of website real estate seemed an awful lot to play around with if, like me, you're a fan of small, text-based websites rather than the ones with lots of fussy graphical elements, such as Flash animations.

So I had a think about what I'd put on a site - something, I hoped, that might serve as a modest resource for other people - learned a little html, the language in which Web pages are written, and went to work. Anyone who has created a website knows what comes next - many late nights trying to make the text align the way you want, or desperately emailing a friend halfway around the world to beg some advice on how to resolve your html glitch from hell.

After a while I tried out various Web-design software programs that made this whole process a lot easier. Even in their earliest versions, most were fairly impressive for a newcomer to website creation - you could do some of the more complex layout tricks just by clicking buttons or dragging and dropping. If what you wanted was a mostly textual website with a few images thrown in, and the possibility of something a bit more fancy when you felt the urge to explore, the programs were ideal.

Before long I had an expanding library of links that had anything (and I do mean anything) to do with the broad categories of technology and culture, for a website called - surprise! - Technoculture (http://indigo.ie/~karlin).

READ SOME MORE

However, a problem began to emerge. In the first flush of the Web's development, the Net seemed like it would be an ideal, always-accessible archive for information. Create a site and there it would be, occupying its own little bit of Web property from now into the future. But that very quickly was not the case.

People begin to abandon their sites. Maybe their business closed, or they would change their Web address, or they'd move to a new university and shut down some interesting online art project they used to curate. Whatever the reason, that meant my own site gradually ended up as a headache of dead and broken links. This trend was particularly true of the oddball sites I included on Technoculture. The dotcom crash took out a whole lot more.

You can download little software tools to help you find the dead links, but this wasn't very high on my agenda. So I just mothballed Technoculture. I left the site up, but hadn't updated it in almost two years. Until the past week, when I decided I really had to go in and clean it up. A few people with well-trafficked sites had linked to Technoculture and that meant it was getting visitors again and also showing up in search engines.

I ran one of those dead-link programs and was taken aback at how much of the Web that I had really liked was now dead and gone - perhaps a fifth of the sites. Many of the casualties were predictable - quirky, labour-of-love sites that were clearly run by designers working for annihilated dotcoms. However, some of the missing-in-action sites seemed odd - big, sprawling sites like much-admired literary site The Libyrinth, that had gone through many incarnations over the years and clearly had extremely dedicated creators.

That's when I started using the search engine Google.com to double-check such sites. And lo and behold, many turned out to be living somewhere else on the Web, not gone and forgotten but simply transplanted. I happily reinstated those places with their new URLs (the Libyrinth is now at www.themodernword.com/). But the experience got me thinking about one of the real and as yet unresolved problems of this particular form of "publication" - linkrot, the increasing prevalence of mouldy old useless links.

Linkrot is not a minor issue - just look at the uproar recently when the newspaper the San Jose Mercury News changed its entire archiving system for its web pages. Over the years, many websites had linked to particular stories, particularly those of the Merc's well-known tech columnist Dan Gillmor. All over the Web, those links went dead. People weighed in to discussion groups and weblogs (web diaries) across the net, voicing their frustrations with this trend.

I've experienced this problem myself, when the Guardian decided to update its archives and in the process eliminated or reclassified many, many articles. Unfortunately that included nearly all the work I'd done for them over a two-year period. Ouch! In many cases I no longer had print versions and, thus, I lost a whole body of my published stories.

Such wholesale reorganisations of the Web actually remake the historical record in a way that simply doesn't happen with other forms of publication. It's as if your local library decided one day to reclassify and relocate whole shelves of its books, and throw out random copies, but without telling anyone where to look for them. Or as if a newspaper just dumped its print archives, trusting that anyone who wanted a copy of a particular story had saved it when it was first printed.

The problem of dead links has serious implications for our ability to maintain and archive our collective, published past. The Web is indeed a new and increasingly mainstream form of publication - for example, scholars now have agreed-upon ways of citing an item published on the Web. But without any standardised way of automatically forwarding people to updated links, we stand to lose large pieces of what once would have been the print record.

All is not entirely lost. Some pages from that lost Web of the past end up cached (stored as a copy) in search engines such as Google.com. Then there's the wonderfully ambitious Internet Archive (www.archive.org) project, which is an attempt to store all of the Web, as it has ever existed, in a searchable form.

But linkrot is a very big and very worrying problem. We need to find a way of creating and maintaining a permanent library of the fragile, ephemeral electronic mesh of information that has become, and will increasingly be, one our major forms of publication.

Karlin Lillington

Karlin Lillington

Karlin Lillington, a contributor to The Irish Times, writes about technology