Text 17164, 141 rader
Skriven 2007-03-29 18:13:52 av mike (1:379/45)
Ärende: Linux to help the Library of Congress save American history
===================================================================
From: mike <mike@barkto.com>
http://www.linux.com/article.pl?sid=07/03/26/1157212
===
The Library of Congress, where thousands of rare public domain documents
relating to America's history are stored and slowly decaying, is about to begin
an ambitious project to digitize these fragile documents using Linux-based
systems and publish the results online in multiple formats.
Thanks to a $2 million grant from the Sloan Foundation, "Digitizing American
Imprints at the Library of Congress" will begin the task of digitizing these
rare materials -- including Civil War and genealogical documents, technical and
artistic works concerning photography, scores of books, and the 850 titles
written, printed, edited, or published by Benjamin Franklin. According to
Brewster Kahle of the Internet Archive, which developed the digitizing
technology, open source software will play an "absolutely critical" role in
getting the job done.
The main component is Scribe, a combination of hardware and free software.
"Scribe is a book-scanning system that takes high-quality images of books and
then does a set of manipulations, gets them in optical character recognition
and compressed, so you can get beautiful, printable versions of the book that
are also searchable," says Kahle.
While previous versions were written for both Linux and Windows, the Internet
Archive has migrated Scribe entirely to Linux, and Windows support has been
dropped. Kahle says the project uses Ubuntu now.
When asked why the Library of Congress chose Scribe for this project, Dr.
Jeremy E. A. Adamson, the library's director for collections and services,
replies that the Internet Archive has already demonstrated "the efficient
production of high-quality images" with it.
Kahle says that a Linux-based Scribe workstation at the Library of Congress
will hold the material to be scanned in a V-shaped cradle -- it doesn't crack
books all the way open -- while two cameras take images of it. A human operator
performs quality assurance, then Scribe sends the digital images across the
breadth of the country to the Internet Archive in San Francisco, where it is
processed and eventually posted online in various formats. Free software is
used almost every step of the way.
"[It's a] Linux-based station out there in the field. It rsyncs the files up to
the servers, [and then] it goes and does the processing on a Linux cluster of
over 1,000 machines, and then posts it online -- also on Linux machines," Kahle
says.
Image processing for an average book takes about 10 hours on the cluster, and
while the project still uses proprietary optical character recognition (OCR)
software, Kahle says that many open source applications come into play,
including the netpbm utilities and ImageMagick, and the software performs "a
lot of image manipulation, cropping, deskewing, correcting color to normalize
it -- [it] does compression, optical character recognition, and packaging into
a searchable, downloadable PDF; searchable, downloadable DjVu files; and an
on-screen representation we call the Flip Book."
The Flip Book is used at The Open Library, a charmingly retro Web interface for
online books that mimics old technologies (clicking "Details" for a title
brings up a yellowed card catalog entry), which the Internet Archive says was
"inspired by a British Library kiosk."
The books are stored in the PetaBox, which is the Internet Archive's massive
million-gigabyte storage system -- a system that Kahle says is "all built on
open source software."
Caring for brittle books
A good number of the historic materials in question are old, fragile, and in
such rough shape that placing them in Scribe's cradle, or even attempting to
read them, could irreparably damage them. Adamson says that some of the books,
for example, have pages "that have become brittle with age"; while Adamson says
these materials are in a broad range of conditions that limit their physical
handling, he uses the general term "brittle books" to describe it. No list of
such brittle materials at the Library of Congress has been made, but Adamson
says that "they comprise a percentage of virtually every collection." Adamson
says the project's objectives include the development of a more formal
classification and description of these "brittle" materials, and to "establish
digitization workflows based on that classification of condition."
If scanning the brittle materials demands new software and digitization
techniques, the Library of Congress will work in conjunction with the Internet
Archive to make the innovations available to the public. But there's no way to
know at this point what they may be, because the project is only getting
underway.
"The project proposal calls for months of planning before any scanning or
engineering is to begin," Adamson says. And the planning, he says, is
"significant": "Space needs to be prepared to accommodate the physical scanning
of books, server storage allocated, project plans need to be written, project
team members briefed, along with myriad other details required for a project of
this magnitude and complexity."
Eventually, Adamson says, when the scanning and processing of materials has
been completed, the high-quality digitized versions of these historic documents
(and metadata associated with them, such as indices and contents) will be
freely accessible online -- which Kahle says is a "huge step" in broadening the
reach of the ever-too-small public domain.
"There may be public domain books that are sitting on shelves, but if you can't
get access to [something], what good does it do to be in the public domain?"
says Kahle. "The Library of Congress is dedicated to keeping [these digitized
holdings] public domain, which I think is a great step that's not being
followed by everybody else."
The program is part of larger efforts, both at the Library of Congress, to
preserve old media and records, and at the Internet Archive, which is already
scanning public domain materials with its Open Content Alliance, a consortium
of about 40 libraries. Kahle says that the alliance is presently operating in
five cities, using the Scribe software, at a brisk clip of 12,000 books a
month.
"We're part of the 'open world' through and through -- we use open source
software, we generate open source software, we generate open content," says
Kahle. "We're trying to take this open source idea to the next level, which is
open content and open access to cultural materials, which means 'publicly
downloadable in bulk.' I think we're really seeing the next level up of this
whole movement -- we had the open network, then open source software, now we're
starting to see open source content."
Links
"Library of Congress" - http://loc.gov/ "Sloan Foundation" -
http://www.sloan.org/ "previous versions" -
http://sourceforge.net/projects/scribesw/ "Ubuntu" - http://ubuntu.com/
"Internet Archive" - http://archive.org/ "netpbm utilities" -
http://netpbm.sourceforge.net/ "ImageMagick" -
http://applications.linux.com/article.pl?sid=05/03/29/1525217&tid=39
"The Open Library" - http://www.openlibrary.org/ "PetaBox" -
http://www.archive.org/web/petabox.php "preserve old media and records" -
http://www.digitalpreservation.gov/ "Open Content Alliance" -
http://www.opencontentalliance.org/
===
/m
--- BBBS/NT v4.01 Flag-5
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)
|