Crawler

You are probably reading this page because you found the link in one of your web server’s access logs. This page explains what is going on.

Your page has been (or is still) visited by a crawler that is run by the information processing and analytics research group to gather regular snapshots of German universities’ and research institutes’ websites as part of the REGIO and Unknown Data projects and the L3S Web Observatory. The employed software is Heritrix, a sophisticated crawler developed and also used by the Internet Archive. In particular, the crawler strictly adheres to the robots exclusion standard which means you can restrict its access by yourself by simple means.

Although there are some heuristics to circumvent so called spider traps, the crawler sometimes is trapped in a web page that creates infinitely many links (a frequent example are calendar web pages). In case you detect such an access pattern, we would be very glad if you could tell us.

If you have any questions, comments or complaints please contact Robert Jäschke via e-mail () or phone (+49 (0)30 2093-70960). Please provide information on the host or URLs that were crawled and the time of the access.

The latest crawl started on December 6, 2022, and is expected to finish on December 20, 2022. Some statistics on prior crawls can be found on the German Academic Web page.

crawl progress