dacs.doc electric

You Can Find it on the Web!

Stealthy search engines save your sanity

by April Miller

 

THE NET may be the coolest thing since Pop Rocks, but it's also completely confusing: In a world where the Charles Schwab mutual fund database is just a mouse click or two away from the Church of Satan sign-up sheet, it's all too easy to lose your bearings. You know there's valuable information out there. The question is how do you find it?

The problem with searching for information on the Internet is twofold. First, there's too much of it. You've got the World Wide Web (the interconnected system of sites and their pages that you access with a browser like Mosaic, Netscape, or Internet Explorer), Gopher, and FTP sites (for finding and downloading files), and zillions of subject-specific newsgroups and mailing lists. Second, all this stuff is utterly unorganized: There's no centralized catalog of its resources, no single place you can go to find what you need.

On the Internet, fortunately, where there's a need, there's an eager programmer (or, increasingly, an entrepreneur) waiting to fill it. And some very bright folks have set up World Wide Web sites that help you zero in on the information you need. I took a critical look at their offerings to find out which of these tools are best at helping you find what you're looking for quickly and easily.

Directory assistance

One type of page--a directory--is great if you're simply interested in a general topic--the Civil War, say, or online finance--and want to find some relevant spots on the Net. Directories contain lists of such sites, organized by topic. My favorite is Yahoo!, which lists some 80,000 Net sites (including Web pages, Gophers, FTP sites, and Usenet newsgroups), divided among 14 top-level categories: Arts, Computers, Health, Recreation, and so on. Click on one, and you get a list of subtopics. Keep drilling down until you find the stuff you want. In addition to general-interest directories like Yahoo!, subject-specific directories cover everything from Antiques to Youth Workers.

Needle in a haystack

While directories are helpful when you're trying to find out what the Web has to offer, they become less so as your questions become more specific. To find answers to such questions, you need a search engine. These are Web pages containing forms into which you type a text string you want to search for. Click a button, wait a bit, and the engine spits out a list of Web sites that match your search criteria. In a recent sweep of the Web, I found some 60 such pages, 10 of which I'd consider useful tools. The rest are either moribund or of interest only to computer-science grad students.

Behind every search engine stands a database, in which are collected the URLs (Universal Resource Locators, or specially formatted Internet addresses) of Web pages and other Net resources. Most of these databases are created by crawlers (also known as "robots" or "spiders"), software programs that roam the Web looking for new sites by following links from page to page. When a spider finds a new page, it adds it to the database.

These databases store from a few thousand to more than a million Web pages, the leading engines adding new pages daily. Of the major general-interest engines, Lycos and Excite have the broadest coverage. These two databases each claim to have 1.5 million fully indexed Web pages. Open Text Index, which says it has 1.3 million, is a close follower.

The size of an engine's database has a big impact on the success of your search. For example, I queried each engine with the string "recipe wheat beer" (Don't ask. I was thirsty). The massive Lycos database gave me 437 hits (matched pages) in return. InfoSeek and Open Text Index gave me around 200 each; others, less than 100. In several cases, I didn't get any hits at all. Generally, the smaller the database, the fewer hits I received.

Most engines restrict themselves to indexing the Web itself. InfoSeek and Excite go a few steps further than the rest by also indexing Usenet newsgroups. For a fee, InfoSeek will also let you search a bunch of handy non-Internet databases.

It's all in the index

Web spiders do more than just collect URLs. They also collect information about each page. The search engine's back-end software uses this information to create an index, which is what you're actually searching when you submit a query. Not surprisingly, indexing techniques vary from engine to engine.

Every engine indexes a page's URL and title. Most engines also index the headers that start each section. Others record the most frequently mentioned words or the first few lines of text. Open Text Index actually indexes every word on the page--including words like "the" and "that," which other engines ignore. As a result, it was the only engine that returned any hits on the query "to be or not to be." Granted, some of these were odd matches--the first hit was a Welsh language primer--but hit #10 (http://www.hamlet.edmonton.ab.ca/scenetxt.htm) was indeed the text of the play. Excite's concept-based indexing can find relevant pages even if they don't contain your specific keywords.

While the size of the database determines the number of hits it delivers, the quality of the indexing is a major factor in determining how many of those hits are relevant to your search. For example, I ran the query "real estate North Carolina Triangle" through each search engine, then counted the number of hits that actually had something to do with finding real estate in the Chapel Hill area. WebCrawler returned 19 hits, compared with the more than 200 hits I got from InfoSeek. But nine of those 19 were exactly on target. Most of InfoSeek's hits related to real estate, but many had nothing to do with North Carolina.

The right tools for the job

No matter how big the database, or how sophisticated the indexing, a search engine is only as good as the query you give it.

Sometimes it's just a matter of phrasing. For example, the query "homebrew wheat beer" wasn't nearly as successful at finding beer recipes as "recipe wheat beer." Not all engines treat your phrases the same way. InfoSeek "stems" words, seeking matches with parts of the whole: Ask for impressionism, for example, and you'll also get matches for impression. Lycos, on the other hand, treats your search term as a stem--so the word metal matches metallic.

Several engines let you search for whole phrases. Instead of just searching for the individual words in your query string, they look for occurrences of them together. Some, like Aliweb let you use wild cards (* and ?) to find variations on a phrase.

In other cases, it's a question of using the available tools. Some engines let you refine your queries with special operators. At the most basic level, this means you can, as with Lycos, search for sites that contain either any or all of your search terms. Others let you use more formal Boolean terms (AND, OR, and sometimes NOT). InfoSeek and Open Text Index are the only engines that give you proximal operators, which let you search for terms that appear near or next to each other.

Using all the tools available can increase the quality of hits dramatically. For example, when I ran that "recipe wheat beer" query past Open Text Index's "simple" search page, I got 90 hits, few of which had anything to do with brewing wheat beer (most were concerned with drinking it). But when I switched to the "power" page and ran the query "recipe (near) wheat (followed by) beer," I got six hits, three of which were exactly on target.

Wheat and Chaff

Your search has only begun when you receive your list of hits: You still have to sort through all those sites to find the ones you really want. Most engines help by showing you, at the top of the results page, the words they actually searched for. You might have asked for "The Good, the Bad, and the Ugly," but the search engine will tell you it actually looked for "Good, Bad, and Ugly." Remember, you can tell many search engines to look for whole phrases instead of just keywords.

Most engines return hits in order of relevance. That way, even if you get 200-plus hits, you don't have to worry about wading through all 200 of them--the top ten will probably do. Different search engines use different methods to calculate relevance. InfoSeek ranks hits according to how frequently your search terms appear in the page relative to their frequency in the entire database. Lycos ranks them based on the number of terms found on the page, their proximity to one another, and their position on the page.

Most engines also give you some kind of description of the hits. Lycos does this best, giving you a relevance rating, a page description, and a brief abstract of its text. Read the abstract, and you'll have a good idea whether that hits what you're looking for.

Metasearching

You don't feel like hopping from one search engine to another to find what you want? Then you should check out a metasearch site. These are pages from which you can use several search engines to launch queries.

Two of these pages (Savvy Search at http://www.cs.colostate.edu/~dreiling/smartform.html and MetaCrawler at http://www.metacrawler.com/) launch your query to several engines at the same time (including most of the engines I looked at individually). Savvy Search also covers ArchiePlex (for searching FTP sites) and DejaNews (for searching newsgroups). The only problem with these parallel searchers is that you don't get full access to each engine's query tools--the Boolean and proximity operators, for example--so your searches will be less accurate than if you used the real thing.

Other metasearch sites let you search the major engines one at a time. You fill in the form for the engine you want to use and send it off. Again, you lose some query tools, but these pages can be handy to keep on file for quick queries.

Still surfin' after all these queries

So, which of these tools should you add to your list of favorites? I'd pick three search engines: Excite, InfoSeek, and Lycos. All three give accurate results from easy-to-use interfaces. I'd also add one directory: Yahoo! It's the most complete directory I found, and it makes an excellent default home page.

Fortunately, these tools are constantly evolving. InfoSeek has announced plans to incorporate a directory into its search site. And Open Text Index has announced plans to team up with Yahoo! to form what could be a formidable combination.

But no matter which of them you end up using, these tools make the Web more than a playground for propeller heads. They make it a place where serious people can do real work--which is exactly what I'm going to do as soon as I check out that recipe for Peach Wheat Ale and visit a couple of houses near Raleigh.

Search tips

A search engine's database is simply an index of words and phrases associated with URLs. Your job is to come up with words that match this index. Here are a few general rules of thumb that will maximize your success.

Read the instructions: Most search engines provide their own set of operators, delimiters, and rules to help you search efficiently. Use them!

Choose the unusual word: The more distinctive a word, the more useful it will be for sharpening your search. For instance, you'll get a more targeted search with "cercopithecus aethiops" than with "African green monkey." And try to pick words that really define your idea: That home-brew query didn't really fizz until I thought to include the word "recipe."

Watch your spelling: If your search query asks for "astronut," you'll get Web pages for the orthographically challenged. By the same token, remember to search for legitimate variations: If you're looking for fly-fishing, try "flyfishing" and "fly fishing" as well.

Think about synonyms: Remember that you're probably searching for a concept, not just a word. If you're looking for backpacking sites, include the terms hiking, trekking, backpacking, and camping in your query.

Forget natural language: Some sites support natural language queries, which let you ask questions the way you would in conversation. Don't. Instead, focus on the key terms and phrases that identify your concept, then enter them as a list.

Repeat yourself: After the first pass, go to some of the most promising-looking hits, and jot down other terms that you can use to sharpen or widen your search.

Don't forget about NOT: Some search engines support the NOT operator, which lets you exclude terms. Thus, with a search like "metal NOT heavy NOT music," you can hit sites dealing with industrial metals and avoid those devoted to heavy metal bands.

Use more than one search engine: I found surprisingly little overlap in the results from a single query performed on several different search engines. So to make sure that you've got the best results, be sure to try your search with numerous sites.

Try specialized sites: If you're looking for government-sponsored Web sites, check out Infomine at http://lib-www.ucr.edu/Main.html. If you want to search Usenet newsgroups, you can use InfoSeek or DejaNews. Finally, if you're looking for downloadable files on the Internet, ArchiePlex at http://flosun.salk.edu/archieplex.html should be able to help.


April Miller, currently director of PC Applications at the Computer Education Institute in Chesterton, Indiana, has been teaching computer topics since 1985. April welcomes feedback to her articles at userfriendly@niia.net.


BackHomeNext