Penn Computing
Computing Menu Computing A-Z
Computing Home Information Systems & Computing Penn

Excluding Directories from AltaVista Search

This document provides instructions on how to create a robots.txt file to prevent AltaVista from indexing a directory on your web site. Presently, AltaVista does not support the META ROBOTS tag, but is expected to do so in the future. Note: Web Administrators who want to exclude robots from indexing information on their Web servers should consult the Search Announcement for Web Administrators for instructions.

Search engines gather information on the Web using programs called Robots. Robots (also called Web Crawlers, Spiders, Worms, Web Wanderers, and Scooters) automatically gather and index information from around the Web, and then put that information into databases. (Note that a Robot will index a page, and then follow the links on that page as a source for new URLs to index.) Users can than construct queries to search these databases to find the information they want. You can prevent any directory from being indexed by creating a robots.txt file for that directory.

Creating a Robots.txt File

The Penn Web Team has made available to Web developers a method that will prevent AltaVista from indexing directories. This method involves creating and uploading an empty robots.txt file to the directory that you do not want indexed. Note that using an empty robots.txt file is a customized method used at Penn. (A *real* robots.txt file isn't empty. It contains the names of robots and directory paths to be excluded from indexing. See Robots.txt Format Example for an example on how *real* robots.txt files are formatted.)

To create a robots.txt file, do the following:

  1. Create an empty text file in a Word Processor or HTML editor. This is a blank file -- don't type anything in it.
  2. Name and save the file as robots.txt. It doesn't matter whether the file name is in uppercase or lowercase letters.
  3. Upload the robots.txt file to the directory that you do not want indexed using your FTP software.

    For example, you want to prevent all the files in the "author" directory from being indexed. The path to this directory is books/sample/author. To prevent AltaVista from indexing the files or documents in the "author" directory, upload your robots.txt file to the "author" directory.

  4. All files in that directory will no longer be available for indexing.
  5. If you decide that you want the directory indexed, just remove the robots.txt file from that directory.

Excluding Individual Pages within Directories

If there are specific documents or files in a directory that should not be indexed, you must do the following to prevent AltaVista from indexing these documents. Create another directory; move the documents or files that you do not want indexed to the directory you just created; remove from other documents any links to these pages; and create a robots.txt file for the new directory. For example, the "author" directory contains the maxell.html, snacks.html, futures.html, and pepper.html files. The maxell.html file should not be indexed. Move the maxell.html file to another directory and create a robots.txt file for this directory.

Adminstrative Housekeeping Services

There will be a weekly housekeeping program that will look for robots.txt files and add the directories that have these files to the main robots.txt.
top

Information Systems and Computing
University of Pennsylvania
Comments & Questions


University of Pennsylvania Penn Computing University of Pennsylvania Information Systems & Computing (ISC)
Information Systems and Computing, University of Pennsylvania