Wednesday, February 24, 2010

IBM leans on Hadoop for BigSheets data analysis

IBM is expected to announce Thursday that it is working with the British Library on a project that will preserve and analyze terabytes of information on the Web before it is lost forever. Recent research estimates the average life expectancy of a Web site is 44 to 75 days. Every six months, roughly 10 percent of Web pages on the U.K. domain are lost.

The new analytics software project, called IBM BigSheets, helps extract, annotate, and visually analyze vast amounts of Web information using a Web browser. The British Library is using a prototype of the software to archive and preserve massive amounts of Web pages to ensure the data doesn't disappear over time.

Beyond just the physical assets, the British Library has been archiving selected Web pages from the U.K. domain since 2004. According to David Boloker, CTO of Emerging Technologies at IBM, with BigSheets, users of the library in the future will be able to access vast archives of historic Web sites and easily research and analyze their queries and visualize the results of the search.

In most cases of personal sites this is no big loss, but in the case of organizations attempting to archive and chronicle elections, news, media, and video, this data leakage presents massive challenges. And even if you have the data the question remains as to whether it will be usable, or even in a recognizable format.

Boloker explained that BigSheets is a private cloud service running parallel MapReduce jobs on all of the library's machines. And while it's a private cloud (take note--private cloud spotted in the wild) the British Library will make the data and services available for people to access.

There is no shortage of data to analyze these days, and more and more government agencies and large corporations will continue to find themselves in search of these types of solutions. What's nice to see is that open source, and perhaps more importantly, Apache-licensed open source software is what next-generation analytics tools are being built on.

 
Shop on Page