Re: [-empyre-] A last post from Internet Archive

Dear fellow digital archivists,

It is only now that I have been able to catch up and join this list after a month of travel in February! It was a daunting task to try and scan through the 100+ emails generated on this topic. However, I was excited to see all of the enthusiasm around the issues of archiving the web and other digital content. In the last 3 years I have been in this position at Internet Archive the increase in activity, interest and resources to do this type of work has been energizing.

I want to thank Paul Koerbin for his thoughtful and accurate comments on my behalf on the activities on IA. I also want to thank Yvette ( whom I spent the first week of February with) providing me a "cheat sheet" of what dialogue had transpired on email.

I will give a quick summary of what the Archive does for anyone still interested, and would welcome any direct emails to me for further clarification( I have no travel scheduled for at least 2 weeks!). I gave a presentation on the Archives web archiving activities and philosophy at the NLA conf. in December. You can find a copy of my presentation at

IA has been taking snapshots of the publicly available web since 1996. The snapshot is done by Alexa( a company now owned by amazon) and the material collected is donated to IA. This snapshot is done every 8 weeks. In 1996 the total amount of data collected was 2 TB for the year. The size of a snapshot today is roughly 100 TB ( uncompressed) and over 4 billion pages. The crawler by Alexa is prioritized based on where the Alexa users visit most frequently. The crawler will start at a list of websites based on Alexa user traffic and the crawler will continue to discover URls from this path over the next 8 weeks. It does not do a focused site by site crawl, it is a broad crawl which means beyond the initial seed list of sites- it is allowed to capture whatever URIs it discovers, and in no particular order. This is why you may find only the first few pages of a website on lesser known sites..

Over the last 2 years, IA has developed the capability to do harvesting internally to supplement the crawl which is donated by Alexa and to provide web harvesting services to our partners. The crawler developed was in cooperation with the Nordic Country National Libraries and through specifications developed within the IIPC framework ( This crawler is opensource and available for use by anyone. You can find more information and the crawler source code at IA has crawling projects with several National Libraries, Archives and Universities. In these instances, we are given a set of web sites, country domain, or topic area to harvest and we run the crawler internally to get the defined "web sphere" in depth. This content is then delivered to the partner institution and/or hosted by IA. An example of this type of partnership can be found at This harvest was done for the US National Archives recently.

I will sign off now, but please feel free to email me directly if you would like more details.

Michele Kimpton
Internet Archive
On Feb 28, 2005, at 12:07 AM, Melinda Rackham wrote:

Dear -empyre-
This month's topic of Preserving Our Online Heritage has
explored  a variety of issues surrounding the collection and
arching of digital records,  whether they be world wide web
research or cultural documents, or one of the many varieties
of internet art. Acquisition and conservation measures are
more than just copying  files onto  terabyte storage disks
for  later retrieval, migration or emulation. As our guests
have vigorously discussed, they involve many ethical,
political, taxonomical and technical  issues.

I would especially like to thank these  guests who are
leaders in preservation of  web resources :  Margaret
Phillips, Paul Koerbin and Gerard Clifton from the PANDORA
internet archive at the National Library of Australia; Nancy
McGovern, Digital Preservation Officer at Cornell University
Library; Sharmin (Tinni) Choudhury, the software engineer
for PANIC digital preservation project; Meta data standards
expert Dr Simon Pockley from Deakin University; and  Luciana
Duranti (UBC), Yvette Hackett and Jim Suderman from
InterPARES 2, Canada.

Thankyou enormously to Doran Golan from
and Valerie leBlanc from Facts and Artifacts  for
representing the collector and artist perspective in this
discussion. Unfortunately both Graham Crawford from Tool,
and Kevin McGarry from Rhizome could not join us due to

Several of the issues which arose in our discussion,
including the ephemerality of net art, and what we should
and shouldn't archive,  are also being discussed
simultaneously on  CRUMB
Please check out their archives as well.

That brings us to the end of a great month, and please join
us  for next months topic of Interactive video.

Thank you all once again.

Melinda Rackham
artist | curator

_______________________________________________ empyre forum

This archive was generated by a fusion of Pipermail 0.09 (Mailman edition) and MHonArc 2.6.8.