[-empyre-] Archives, metadata and searching
I'll respond primarily to komninos's question about searching and
metadata.
This is a real issue and one that does cause us some headaches at the
NLA. As a Library we tend to invest a lot of our work in metadata
creation in MARC records that comprise the NLA's own online catalogue
and the National Bibliographic Database. As librarians that do this work
we would like to think people use these catalogues to find resources.
The reality, as we know, is that people are more likely to come to a web
portal or search engine to locate web publications. Unfortunately and
frustratingly, we have found it very difficult to find a search engine,
that we can afford, that can usefully index the PANDORA archive and
provide good relevant and ranked access to search results. To provide
full text indexing we need to index currently around 15 million files
(this is a rough guess). Even if we achieve this problems remain in
providing useful search results. Many sites we archive regularly and
this involves capturing the whole site again, so the same files will
appear many times and therefore there is the problem as to what version
is presented in search results (All?, the first harvested copy? The
last?).
We have in fact taken the decision to remove the search engine
temporarily from PANDORA as we found it caused more frustration than
joy. We are looking at a temporary solution that may provide a very
minimal sort of search, just really at the title level. To some degree
this is already achieved by google. For example if you do a search for
Flight of Ducks and add pandora to the search query the link to the
PANDORA Archive appears at the top of the list. Assuming you don't know
about PANDORA but are interested in an archived version you could do the
same search replacing 'pandora' with 'archive' and the link to the
PANDORA version appears second on the list. We only allow google to
index to the 'title entry page' level of the Archive however, not deep
into the contents of the Archive.
On komninos's other point, the resources in the PANDORA Archive are
maintained at the NLA although we have partners selecting resources and
undertaking the archiving in other agencies (e.g. all the mainland state
libraries). The reason the NLA maintains the archived resources is that
other agencies are not currently in a position to do so. But it would
certainly be desirable to share the burden. The PANDORA Archive does not
cover Tasmanian resources because these are archived by the State
Library of Tasmania through Our Digital Island
http://odi.statelibrary.tas.gov.au/ Currently PANDORA and ODI only
provide top level links to each other, but the idea of providing access
to resources in both archives through a common portal has certainly been
discussed.
If you are going to have distributed archives, which I think is the way
we are necessarily heading, the archives need to have a commitment to
long-term preservation (and all that that entails!). This may come from
an interest in maintaining the work of their staff, as in institutional
repositories such as in universities. Anyway, the vision of the
International Internet Preservation Consortium
(http://www.netpreserve.org) as articulated by the then IIPC coordinator
Julien Masanes at the Archiving Web Resources international conference
last November (at the NLA) was certainly one of an interconnected grid
of archives.
Komninos mentioned 'digitised' holdings being held at various locations.
This is a bit outside what I am focused on, which is born digital
publications collected and maintained in an archive, but an example of
distributed digitised holdings accessible through a search facility is
Picture Australia http://www.pictureaustralia.org/
Paul
Paul Koerbin
Supervisor
Digital Archiving Section
National Library of Australia
(02) 6262 1411
pkoerbin@nla.gov.au
This archive was generated by a fusion of
Pipermail 0.09 (Mailman edition) and
MHonArc 2.6.8.