|May 1996 Volume 12:6
By Patricia Renfro and William Garrity
Consider the following:
Today the best approach to working with this amorphous and chaotic mass of information is to mine it with a good Web search tool. Usually these are search engines that query large, descriptive indexes of Web pages. Each search engine typically searches its own index of page descriptions, not the Web itself. Some engines add thousands of entries a day to their databases.
Although most search tools encourage you to
register your site with them, these vast indexes are
primarily built by robots. Robots, also called bots,
spiders, wanderers, worms, and crawlers, are programs
that live on the search services' host computers. Despite
the names, spiders and such don't actually traverse the
Web. They stay on a host computer and retrieve informa
tion using the usual Web protocols such as http (hypertext transfer
protocol). The information retrieved is then
used to build the indexes. Effective search engines revisit
sites periodically to update their databases.
Meta-indexesSo - what search engine should you use to find your way through the Net? One good option is to try a meta-index such as MetaCrawler. Simultaneously querying eight search engines - Open Text, Lycos, WebCrawler, InfoSeek, Excite, Inktomi, Alta Vista, and Galaxy - MetaCrawler sends back an integrated set of results. Search options include phrase, all words (Boolean AND), or any words (Boolean OR), and searches can be limited by region (country, domain) and by site type, for example, .edu or .com. No single search engine can give comprehensive coverage of Internet sites, nor are all engines equally effective for all types of searches. With a meta-index you don't need to learn a variety of interfaces or know the relative strengths of each index, and you're closer to getting a comprehensive search.
Savvy Search is a new meta-index that reduces network traffic and claims to be sensitive to server loads and types of searches. This experimental system constructs a search plan for each query by ranking and grouping 19 search engines and directing the search to run simultaneously on the engines in the top-ranked group. The searcher can opt to continue the search through the next set of (second ranked) indexes. We found that Savvy Search's top choices produced variable results - at times it was hard to believe that the best engines really had been selected.
But - at some point you'll want to get to know some specific indexes. Here are a few things to consider when choosing a Web search engine:
WebCrawlerBigger isn't necessarily better! If you're in a hurry and you're looking for a known item, e.g., the Daily Pennsylvanian home page, the text of the Communications Decency Act, or Penn's notable African Studies page, you may want to try a non-comprehensive, economical search engine. Fast and simple, WebCrawler was one of the first search engines on the Web.
Originally developed at the University of Washing
ton, it's been maintained by America Online since June
1995. The easy-to-use interface presents a limited
number of options: all or any words, and a choice of
10, 25, or 100 results (with an option to continue with
more). WebCrawler searches the full text of all pages
and presents very brief results - the title of the Web
page only. A useful relevance number at the left of the
result line indicates the relative relevance ranking of
each result. WebCrawler doesn't waste your time
retrieving all the sub-pages for a site - it generally
takes you straight to the top level if that's what you're
MagellanYou can waste a lot if time retrieving pages only to find that they weren't worth the wait! Search engines that provide useful descriptions can save you time. So a product like The McKinley Group's Magellan is welcome. The Magellan result sets come with text written by real people! Readable and grammatical, they're a nice change from the more commonly found computer-generated text. For example, a Magellan search for the text of the Maastricht Treaty presents the official European Union page as the top-ranked document with a useful accompanying blurb: "this site provides visitors with the full text of the Maastricht Treaty, the treaty which is the backbone of the European Union..." Good to know this isn't a personal home page with a glancing mention of the Treaty.
Magellan bears watching. It's pretty new, ambitious,
and different. With a highly informed editorial staff it
ranks 40,000 of its 2 million indexed sites with a four-star ranking for
depth, accuracy and timeliness, organization and ease of access, and net
appeal. Magellan itself gets a low ranking from us on timeliness - at press
time it still listed the Penn Library Gopher as alive and well
even though we killed it over a year ago!
Lycos and Alta VistaBut if you're wondering if your best friend from first grade has a personal home page yet, or you need to find the text of an obscure poem or a quote from Oscar Wilde, you'll want to try one of the search engines that doesn't even consider selectivity useful! For comprehensivess it's hard to beat Lycos, the engine that claims unabash edly to be the biggest catalog on the planet. Lycos' most recent count (February/March 1996) indicates that it indexes 11.6 million Web pages (adding at a rate of 50,000 a day!).
Fighting it out for biggest and best is Digital
Equipment's Alta Vista, which also claims to give
access to all 8 billion words found in over 16 million
Web pages! Alta Vista is worth knowing about because
it indexes both the Web and Usenet news groups - you
choose which you want to search up front. But watch
out for systems like this. If you're looking for the DP
home page, the Alta Vista simple search won't do it -
it finds 9000 matches. The advanced search works
fine - but you need to read the instructions before
What's aheadUnless there are some significant changes, chances are that it'll be increasingly difficult to find exactly what you need on the Web. Sun Microsystems Java program ming language for the Web is generating all kinds of commotion - could Java applets make it easier to find information on the Web? While it's still much too early to tell, it's certainly conceivable that Java programs could do automatic, unattended database-oriented searches across the Web and return the results to you. Clearly systems like Alta Vista, with the power to search on a URL, title, links, specific newsgroups, etc., will be increasingly valuable - but the effectiveness of Web search tools will still be limited by the format of the data they're indexing.
The downside of the "everyone a publisher" environ ment of the Web is the lack of standards. In contrast to the highly controlled library model of MARC standards (machine readable cataloging), which have resulted in the consistency and effectiveness of online library catalogs, Web document creators follow few rules. The National Center for Supercomputing Applications (NCSA) and the Online Computer Library Center (OCLC) are sponsoring promising efforts to identify a set of metadata elements (The Dublin Core) that could become the required standards that Web page creators would use to describe their documents (subject, author, title, date, type of file, genre of work, etc.). Whether the free-wheeling democracy of the Web will respond to such an initiative is an open question, but it's possible that serious information providers would conform. Meantime, stay tuned! You can be sure that none of this will stay the same!
PATRICIA RENFRO is Director of Public Services for the University of Pennsylvania Libraries; WILLIAM GARRITY is Associate Director for Information Services at the Biomedical Library.
ReferencesBaer, William M.; Courtois, Martin; Stark, Marcella. "Cool Tools on the Web." Online (November/December 1995) pp. 14-32.
Kimmel, Stacey. "Bot generated Databases on the World Wide Web. " DATABASE (February/March 1996) pp. 41-49.
Scoville, Richard. "Find it on the Net." PC World (January 1996) pp. 125-130.
Selberg, Erik and Etzioni, Oren. "Multi-Service Search and Comparison Using
Weibel, Stuart; Godby, Jean; Miller, Eric. OCLC/NCSA MetaData Workshop
Weibel, Stuart; Godby, Jean; Miller, Eric. OCLC/NCSA MetaData Workshop