Re: [-empyre-] A last post from Internet Archive

Michele Kimpton <michele@archive.org> · Tue, 1 Mar 2005 10:05:22 -0800

Dear fellow digital archivists,
It is only now that I have been able to catch up and join this list 
after a month of travel in February!  It was a daunting task to try and 
scan through the 100+ emails generated on this topic.  However, I was 
excited to see all of the enthusiasm around the issues of archiving the 
web and other digital content.  In the last 3 years I have been in this 
position at Internet Archive the increase in activity, interest and 
resources to do this type of work has been energizing.
I want to thank Paul Koerbin for his thoughtful and accurate comments 
on my behalf on the activities on IA.  I also want to thank Yvette ( 
whom I spent the first week of February with) providing me a "cheat 
sheet" of what dialogue had transpired on email.
I will give a quick summary of what the Archive does for anyone still 
interested, and would welcome any direct emails to me for further 
clarification( I have no travel scheduled for at least 2 weeks!).  I 
gave a presentation on the Archives web archiving activities and 
philosophy at the NLA conf. in December.  You can find a copy of my 
presentation at

http://www.nla.gov.au/webarchiving/program.html
IA has been taking snapshots of the publicly available web since 1996.  
The snapshot is done by Alexa( a company now owned by amazon) and the 
material collected is donated to IA.  This snapshot is done every 8 
weeks.  In 1996 the total amount of data collected was 2 TB for the 
year.  The size of a snapshot today is roughly 100 TB ( uncompressed) 
and over 4 billion pages. The crawler by Alexa is prioritized based on 
where the Alexa users visit most frequently.  The crawler will start at 
a list of websites based on Alexa user traffic and the crawler will 
continue to discover URls from this path over the next 8 weeks.  It 
does not do a focused site by site crawl, it is a broad crawl which 
means beyond the initial seed list of sites- it is allowed to capture 
whatever URIs it discovers, and in no particular order.  This is why 
you may find only the first few pages of a website on lesser known 
sites..
Over the last 2 years, IA has developed the capability to do harvesting 
internally to supplement the crawl which is donated by Alexa and to 
provide web harvesting services to our partners.  The crawler developed 
was in cooperation with the Nordic Country National Libraries and 
through specifications developed within the IIPC framework 
(netpreserve.org).  This crawler is opensource and available for use by 
anyone.  You can find more information and the crawler source code at 
crawler.archive.org.  IA has crawling projects with several National 
Libraries, Archives and Universities.  In these instances, we are given 
a set of web sites, country domain, or topic area to harvest and we run 
the crawler internally to get the defined "web sphere" in depth.  This 
content is then delivered to the partner institution and/or hosted by 
IA.  An example of this type of partnership can be found at 
www.webharvest.gov.  This harvest was done for the US National Archives 
recently.
I will sign off now, but please feel free to email me directly if you 
would like more details.
Sincerely,
Michele Kimpton
Internet Archive
On Feb 28, 2005, at 12:07 AM, Melinda Rackham wrote:
Dear -empyre-
This month's topic of Preserving Our Online Heritage has
explored  a variety of issues surrounding the collection and
arching of digital records,  whether they be world wide web
research or cultural documents, or one of the many varieties
of internet art. Acquisition and conservation measures are
more than just copying  files onto  terabyte storage disks
for  later retrieval, migration or emulation. As our guests
have vigorously discussed, they involve many ethical,
political, taxonomical and technical  issues.
I would especially like to thank these  guests who are
leaders in preservation of  web resources :  Margaret
Phillips, Paul Koerbin and Gerard Clifton from the PANDORA
internet archive at the National Library of Australia; Nancy
McGovern, Digital Preservation Officer at Cornell University
Library; Sharmin (Tinni) Choudhury, the software engineer
for PANIC digital preservation project; Meta data standards
expert Dr Simon Pockley from Deakin University; and  Luciana
Duranti (UBC), Yvette Hackett and Jim Suderman from
InterPARES 2, Canada.
Thankyou enormously to Doran Golan from ComputerFineArts.com
and Valerie leBlanc from Facts and Artifacts  for
representing the collector and artist perspective in this
discussion. Unfortunately both Graham Crawford from Tool,
and Kevin McGarry from Rhizome could not join us due to
illness.
Several of the issues which arose in our discussion,
including the ephemerality of net art, and what we should
and shouldn't archive,  are also being discussed
simultaneously on  CRUMB
http://www.jiscmail.ac.uk/lists/new-media-curating.html.
Please check out their archives as well.
That brings us to the end of a great month, and please join
us  for next months topic of Interactive video.
Thank you all once again.
Melinda Rackham
artist | curator
www.subtle.net

_______________________________________________
empyre forum
empyre@lists.cofa.unsw.edu.au
http://www.subtle.net/empyre