Building an archive for WebGlimpse

For an overview of WebGlimpse, please see the WebGlimpse home page.

Building and indexing an Archive

If you are configuring a new archive, run confarc [archivedir] [virtualhost]. You will be asked several questions, listed and explained below. It's probably a good idea to have this page in front of you when you answer the questions.

Directory where the index and other WebGlimpse generated files will reside:

It's probably a good idea to use the root directory of the archive (where the main pages reside). All index files will be built in this directory. You cannot use the same directory for more than one archive.

Archive title

Name the archive any way you want. We suggest to put some thought into it. People may want to find you later using this description.

Full path to HTTP server config file

Normally this will be stored in a site configuration file, but if it was not found you will be prompted for it. Enter the full path (starting with /) to the file containing DocumentRoot, UserDir and other settings.

What is the DocumentRoot for this web server?

If the setting for DocumentRoot could not be found in the HTTP server configuration file, you will be prompted for it. Enter the full path to the directory where your web pages are stored.

Index by Directory, or Traverse URLs?

Index by Directory is the traditional way most local search engines work. You specify one or more directories, and all the files contained in those directories are indexed. Files in subdirectories will be indexed too.

A Traversal-based configuration will automatically scan your archive starting from root URLs, which you will give later, and adding all URLs within a certain number of hops (given later).

Enter D to index by Directory, or T to Traverse URLs

Note: The following questions are asked only if you chose Traverse URLs

Do you allow traversal of remote pages?

WebGlimpse may be used to automatically fetch URLs pointed to from your local pages. The content of these "remote pages" will be copied to your local disk (in the .remote directory) and indexed. (If you are short on space and want to remove the content, the search will still work.) Be careful, however, when you use this option, because WebGlimpse will traverse the remote sites until either the limit on the number of hops or the limit on the total number of pages is reached (whichever comes first). You may be fetching too many pages.

Follow only *explicitly* defined remote links?

To prevent out of control fetching of remote pages, you can use this option. Only URLs that are explicitly listed in one of your local pages will be collected.

Number of allowed hops from each root URL

Only pages that can be reached from any of the root URLs (to be given later) within no more than this number of hops will be collected. The same limit is enforced for local and remote pages.

maximum number of local pages

As an added precaution against out-of-control collection, you can set a limit on the total number of pages collected locally.

maximum number of remote pages

We strongly recommend that you set this limit to be low and that you periodically check that you are collecting only relevant pages.

Define a neighborhood by the following number of hops from each page

WebGlimpse will allow you to search from each local page limiting the search to a neighborhood of that page. The neighborhood will be computed based on this number of hops.

Add search boxes to pages?

WebGlimpse will automatically add search boxes (taken from .wgbox) to all local pages (unless you use the exclude facilities). If you say "no" here, there will be only one search page for the whole archive -- wgindex.html.

confarc now constructs some files and adds them to the archive directory. The most important ones for you are .wgfilter-index and .wgfilter-box. They allow you to exclude files from being indexed and/or from being modified with a search box. The first file (-index) provides a way to exclude files from being indexed. The rules are similar to the way Harvest excludes its collection. The default file is pretty straightforward. It works by pattern matching to the file names. The second file (.wgfilter-box) provides a way to exclude local html files from adding the WebGlimpse search box. Same rules. (Obviously, if a file is not to be indexed, no search box will be added to it.) Standard .wgfilter-index and .wgfilter-box are added. You can change them at any time, and they will take effect the next time you index.

Finally comes the moment you've been waiting for: You now enter the root URLs from which WebGlimpse will do the traversal and indexing. WebGlimpse will follow all links (recursively) from the ones you give.

Now you will need to enter the URL(s) of the file(s) you would like to traverse. Enter a blank line to exit this portion.: Just enter one URL per line. Chances are that you need only one root URL.

confarc uses two main scripts: makenh (which computes neighborhoods to index) and addsearch (which adds the appropriate search boxes). Each of them can be run separately now or later.

To clean an archive, removing all the stuff that WebGlimpse added, run rmarc. You will only be asked to give the directory of the archive.

Written by Udi Manber
Updated by Golda Bernstein.
glimpse@cs.arizona.edu