Seek and Ye Shall Find...Some More:
Deep Searching

by John Mather

Copyright © 1996 John Mather. All rights reserved.

In the last issue of WWWiz, we looked at how to use the various search engines to perform fairly routine searches and, indeed, for 90% of your search requirements, the techniques outlined in that article are sufficient.

Sometimes, however, especially if you are performing a search of thousands of academic papers or, say, university libraries, more powerful techniques are required, if only to reduce search times to acceptable limits. In this issue, we look at Deep Searching.

The first key to deep searching is the use of Boolean Logic (no, don't go away quite yet!). George Boole was a mid-1800s English schoolteacher who, literally as a hobby, invented a new branch of mathematics in which what he called "concepts" could be expressed and solved in the same way as an algebraic equation. At the time, nobody paid much attention and it wasn't until computers came along that it was realized that Boole's discovery was really the key to the entire field of computer design—hardware and software.

While to really understand the work of Boole you need to study Set Theory and other arcane concepts, to use his techniques for Web searching, all you really need to know are the Boolean Operators, AND, OR and NOT.

When you perform a simple search using a search engine such as I described in the last issue, you will usually retrieve the information you are looking for, but often with some complications—either the search turns up too many references or, sometimes, too few. This is where Boole comes in.

If you retrieve too many documents, you need to narrow the search; the solution is probably to use the AND operator. The form in which this is expressed depends somewhat, as usual, on the search engine you are using, but is almost always in some variation of:

<Search term 1> AND <Search term 2>

The reason this narrows the search is because only documents that contain both terms will be retrieved. As an example, let's say you want to search for information on red wine. As previously described, go to your favorite search engine and enter "wine." Let's say the result retrieves several hundred (or thousand) documents. You can browse through them one at a time if you want, but in this situation it obviously would pay to narrow the search. Modify your search parameters to "wine AND red."

This immediately limits the search to only those documents which contain both "wine" and "red." This may still represent a large number of retrieved documents and you may want to further narrow the search by (for instance) searching for "wine AND red AND California" thus retrieving only documents about California red wines.

Alternatively, having performed a search, you may find that not enough documents were retrieved to meet your need for information. For instance, let's suppose you want to research the relationship between J.R.R. Tolkien and C.S. Lewis. Your first inclination is probably to use both names together with the AND operator to narrow the search to the desired subject. Unfortunately, this is likely to turn up only a few documents, most of which may give a short mention of their relationship. To obtain information which really reveals the two men you need to widen the search by substituting the OR operator for the AND operator. This retrieves all documents which contain the name of "Tolkien OR Lewis" thus providing a great deal of supporting material as well as mentions of their collaboration.

As you widen your search, you may well be aware of subject matter which your search will retrieve, but which you know you don't want. An example might be research into angels where you know you don't need information about the California Angels baseball team. To exclude this information, use the NOT operator to eliminate documents with words which might indicate the baseball team. For a start, you might try "Angels NOT California." However, some documents might call them the "Anaheim Angels." To truly eliminate most false hits, you might end up with "Angels NOT California NOT Anaheim NOT baseball." Note that some engines call this operator BUT NOT, as in "Angels BUT NOT California BUT NOT Anaheim BUT NOT baseball."

Some search engines have their own specialized syntax and some have additional operators. For example, several search engines have an operator NEAR, which allows you to specify that two terms must be close to each other (the default is usually 100 words) while some also use FOLLOWED BY, which indicates that the succeeding parameter must immediately follow the preceding one. These operators are called Proximity Operators to distinguish them from true Boolean Operators.

Let's look at some other techniques which can affect deep searches. Some search engines, for instance, OpenText, allow you to specify whether you want to search URLs (the Web address, such as http://wwwiz.com), titles, main headings or the entire document. To narrow a search to a usable number of documents, you might want to restrict a search to titles or headings on the grounds that this will limit the search to documents which include your search parameter as a main topic.

Another technique is truncation, although this one can backfire on you. Several engines allow you to type partial words which allow you to use (for instance) "advert" to find "advert," "advertisement" and "advertising." While this can be incredibly useful, it also has drawbacks in that an engine may not know if you are truncating and therefore include "carrots" in your search for information on automobiles using the search parameter "car." For those engines which assume truncation, you may wish to turn truncation off, when typing a full word, by ending the word with a space. Other engines won't allow truncation, so typing a partial word will retrieve nothing (unless the partial word is in the document somehow).

Finally, some engines allow you to "weight" certain parameters as being more important than others.

As I'm sure you are beginning to suspect, each engine has its own methods of searching and may well retrieve substantially different documents based on the same search parameters. As you become more proficient in deep searching you will probably develop preferences for one or more engines. Nevertheless, read the guides (often accessed through links with names like "Search Tips") which can be accessed from most engines in order to understand how this particular engine will treat your search. A number of engines will actually pass your search on to other engines if you are not happy with the results obtained so far. Others will allow you to refine your search by starting with your current retrieval list and entering additional search terms. Once again, OpenText offers a box after your search, labeled Improve Your Search. Note also that some engines actually have two different search pages for Simple and Deep Searches while others provide separate buttons, but only one parameter entry field.

The key to Deep Searching is familiarity. If you are not a "Power Searcher" you can probably retrieve almost everything you need using simple searches. If getting detailed information is important to you, spend a little time exploring the main search engines and finding their strengths and weaknesses. Then use the different facilities to find the information you need.

If your goal is a simple search over as many engines as possible, you can do this, too. A number of companies are beginning to offer facilities where your search parameters will be submitted, automatically, to several engines from a single input. An example is Supersearch, which submits your parameters to Lycos, InfoSeek, WebCrawler, Yahoo, Alta Vista, DejaNews, Excite, OpenText and Inktomi, and displays the results in separate frames on a single Web page. The difference in the retrieved documents is quite interesting in its own right but, certainly, if there is data to retrieve, this is likely to find you something to investigate.

In the next issue, we'll look at retrieving information from other sources on the Internet and look at ways to find an individual whose email address you don't know.

Good searching!


John Mather is the President of Winformation Software, which markets Appeal, a Windows data base utilizing Winformation's revolutionary AutoRelational technology. Appeal combines the power of Relationality with unprecedented ease of use. Visit Winformation's Web site