|

Penn Web Search Technical Information
The Penn Web Search indexes the content on about 375 web servers across Penn's campus. Most of these are administered directly by schools and departments of the University. The central web server, www.upenn.edu, provides web sites for University of Pennsylvania schools, departments, centers, and institutes that do not have access to a web server within their school or department; information about housing a web site on www.upenn.edu is available.
The www.upenn.edu service is hosted on a pair of fully redundant Compaq AlphaServer
DS20E systems running Tru64 UNIX 5.1. Each has dual Alpha 21264 667
MHz processors and two gigabytes of memory, and is connected to a Compaq
StorageWorks EMA12000 Fibre Channel RAID System. The configuration is
designed to survive the failure of any component, including the loss of an
entire machine room, power grid, local network, or Internet connection.
This highly redundant, survivable configuration is hosted in two data
centers approximately five blocks apart. Each location houses one of the
hosts and one side of the fully mirrored storage array, is served by
separate power grids, and includes fully redundant power and HVAC systems.
Data is replicated in real time between the two locations.
Apache 1.3.26 is the web server daemon.
You can search Penn's web using two different tools: the Simple Search,
based on Google's index of the Penn Web,
or the Advanced Search, using
AltaVista Search INTRANET 2.3A.
The Advanced Penn Web Search starts indexing from the central web server
homepage and follows links down until it covers all of the web servers running within
the upenn.edu domain. The index is replaced every two weeks. For the installation date of the current index, see http://www.upenn.edu/search/updates.html.
We have provided information to help you to maintain
your documents so that they are better indexed by Penn Web Search and other search
engines.
A site will be automatically included in the central Advanced Search index if there is a link to the site's top level page
somewhere within the hierarchy of links. Approximately 250,000 pages are being indexed. If your site (or server) is not being indexed by the central index, it may not be linked from anywhere in the hierarchy, and we will have to explicitly include your site in the central index. Please ask your web administrator to send mail to webmaster@upenn.edu
giving us the starting URL of your web site.
If there are documents on your particular site that should not be indexed, your web administrator will need to maintain
a robots.txt file that will prevent the central index from indexing these documents.
Example of excluding with a robots.txt file
If you had a directory called "test" where you stored pages under development and you didn't want these documents indexed,
your server administrator would create a file called robots.txt in the document root directory that looks like this:
User-agent: *
Disallow: /test/
For more detailed information about excluding web pages from being indexed, please see the
Standard for Robot Exclusion.
|