|

Penn Web Search Technical Information
The Penn Web Search indexes the content on about 375 web servers across Penn's campus. Most of these are administered directly by schools and departments of the University. The central web server, www.upenn.edu, provides web sites for University of Pennsylvania schools, departments, centers, and institutes that do not have access to a web server within their school or department; information about housing a web site on www.upenn.edu is available.
The www.upenn.edu service is hosted on a pair of fully redundant
Sun 40Z systems running Linux. Each has four AMD Opteron 2.6 GHz processors and 16 gigabytes of memory, and is connected to a Sun StorEdge 9970 Storage Area Network. The configuration is designed to survive the failure of any component, including the loss of an entire machine room, power grid, local network, or Internet connection
This highly redundant, survivable configuration is hosted in two data
centers approximately five blocks apart. Each location houses one of the
hosts and one side of the fully mirrored storage array, is served by
separate power grids, and includes fully redundant power and HVAC systems.
Data is replicated in real time between the two locations.
In addition to having redundant hardware we also utilize Akamai caching that will permit pages that have been cached by Akamai to be served in the event that we have a failure of both servers.
Apache 1.3.x is the web server daemon.
You can search Penn's web using
Google's index of the Penn Web.
We have provided information to help you to maintain
your documents so that they are better indexed by Penn Web Search and other search
engines.
If there are documents on your particular site that should not be indexed, your web administrator will need to maintain
a robots.txt file that will prevent the central index from indexing these documents.
Example of excluding with a robots.txt file
If you had a directory called "test" where you stored pages under development and you didn't want these documents indexed,
your server administrator would create a file called robots.txt in the document root directory that looks like this:
User-agent: *
Disallow: /test/
For more detailed information about excluding web pages from being indexed, please see the
Standard for Robot Exclusion.
|