Building an archive for WebGlimpse
For an overview of WebGlimpse, please see the
WebGlimpse home page.
Building and indexing an Archive
If you are configuring a new archive,
run confarc [archivedir] [virtualhost].
You will be asked several questions, listed and explained below.
It's probably a good idea to have this page in front of you when you
answer the questions.
-
Directory where the index and other WebGlimpse generated files will reside:
-
It's probably a good idea to use the root directory of the archive
(where the main pages reside).
All index files will be built in this directory. You cannot use the same
directory for more than one archive.
-
Archive title
-
Name the archive any way you want. We suggest to put some thought into
it. People may want to find you later using this description.
-
Full path to HTTP server config file
-
Normally this will be stored in a site configuration file, but
if it was not found you will be prompted for it. Enter the full
path (starting with /) to the file containing DocumentRoot, UserDir
and other settings.
-
What is the DocumentRoot for this web server?
-
If the setting for DocumentRoot could not be found in the HTTP
server configuration file, you will be prompted for it.
Enter the full path to the directory where your web pages are
stored.
-
Index by Directory, or Traverse URLs?
-
Index by Directory is the traditional way most local search engines work.
You specify one or more directories, and all the files contained in those directories
are indexed. Files in subdirectories will be indexed too.
A Traversal-based configuration will automatically scan your archive
starting from root URLs, which you will give later, and adding all URLs
within a certain number of hops (given later).
Enter D to index by Directory, or T to Traverse URLs
Note: The following questions are asked only if you chose
Traverse URLs
-
Do you allow traversal of remote pages?
-
WebGlimpse may be used to automatically fetch URLs pointed to
from your local pages. The content of these "remote pages"
will be copied to your local disk (in the .remote directory)
and indexed. (If you are short on space and want to remove the
content, the search will still work.)
Be careful, however, when you use this option, because WebGlimpse
will traverse the remote sites until either the limit on the
number of hops or the limit on the total number of pages
is reached (whichever comes first).
You may be fetching too many pages.
-
Follow only *explicitly* defined remote links?
-
To prevent out of control fetching of remote pages, you can use
this option. Only URLs that are explicitly listed in one of
your local pages will be collected.
-
Number of allowed hops from each root URL
-
Only pages that can be reached from any of the root URLs
(to be given later) within no more than this number of hops will
be collected. The same limit is enforced for local and remote
pages.
-
maximum number of local pages
-
As an added precaution against out-of-control collection, you can set
a limit on the total number of pages collected locally.
-
maximum number of remote pages
-
We strongly recommend that you set this limit to be low and that
you periodically check that you are collecting only relevant pages.
-
Define a neighborhood by the following number of hops from each page
-
WebGlimpse will allow you to search from each local page
limiting the search to a neighborhood of that page.
The neighborhood will be computed based on this number of hops.
-
Add search boxes to pages?
-
WebGlimpse will automatically add search boxes (taken from .wgbox)
to all local pages (unless you use the exclude facilities).
If you say "no" here, there will be only one search page for the
whole archive -- wgindex.html.
confarc now constructs some files and adds them to the archive
directory. The most important ones for
you are .wgfilter-index and .wgfilter-box.
They allow you to exclude files from being indexed and/or from
being modified with a search box.
The first file (-index) provides a way to exclude files from
being indexed. The rules are similar to the way
Harvest excludes
its collection. The default file is pretty straightforward.
It works by pattern matching to the file names.
The second file (.wgfilter-box) provides a way to exclude
local html files
from adding the WebGlimpse search box. Same rules.
(Obviously, if a file is not to be indexed, no search box will be
added to it.)
Standard .wgfilter-index and .wgfilter-box are added. You can change
them at any time, and they will take effect the next time you
index.
Finally comes the moment you've been waiting for: You now enter
the root URLs from which WebGlimpse will do the traversal and
indexing. WebGlimpse will follow all links (recursively) from
the ones you give.
-
Now you will need to enter the URL(s) of the file(s) you would
like to traverse. Enter a blank line to exit this portion.
-
Just enter one URL per line. Chances are that you need only one root
URL.
confarc uses two main scripts: makenh (which computes neighborhoods
to index) and addsearch (which adds the appropriate search boxes).
Each of them can be run separately now or later.
To clean an archive, removing all the stuff that WebGlimpse
added, run
rmarc.
You will only be asked to give the directory of the archive.
Written by Udi Manber
Updated by Golda Bernstein.
glimpse@cs.arizona.edu