next up previous


AntWorld

AntWorld: A Collaborative Web Search Tool

Vladimir Meñkov - David J. Neu - Qin Shi

SCILS, Rutgers University
4 Huntington St, New Brunswick, NJ 08903
vmenkov@cs.indiana.edu, djneu@acm.org
http://aplab.rutgers.edu/ant/

Abstract:

We describe our experiences designing and using AntWorld, a tool developed to make it easier for the members of a common-interest user group to collaborate in searching the web. It is hoped that this report may be of interest to designers of similar systems, as well as to potential users.

   
Introduction

The amount of information currently available on the World Wide Web or in specialized full-text databases (e.g., patents, legal documents) often makes it difficult to find useful documents using only purely automatic techniques, such as the keyword search. Finding information on a particular specialized topic--e.g., ``How does the Canadian tax law treat a Roth IRA?'' or ``How to use unsigned JavaScript in a layer within a signed HTML document?'' may require the human user to apply his or her intelligence in various ways: trying several different queries with various search engines, reading a number of documents found by a search engine to determine if they are relevant, following sequences of links, learning new terms to refine one's queries, etc. Thus, if a number of people with common interests and needs often perform web searches on related topics, much effort could be saved with a system that makes use of the intelligence of all its previous users to guide the current user to documents that are likely to be useful for his goals.

AntWorld [1,7], developed at Rutgers University and publicly available at our web site http://aplab.rutgers.edu/ant/, is the result of our attempt to design a tool which harnesses the expertise of the members of a common-interest group, as displayed by their evaluation of documents encountered while searching. It stores users' judgments about the documents they find, and uses this information to guide other users to pages they may find useful.

Our design criteria included the following:

1.
 Fine context granularity: We need to capture not simply a ``user's opinion'' about pages, but his opinion about usefulness of a page for a particular goal he has in mind. The system should allow the user to specify his goal (quest) as precisely as he desires.

2.
  Dynamic quest matching: The system should continuously provide guidance for each user based on his current quest, as well as all quests (of this and other users) which are similar to the current quest. This selection of similar quests should be refined as the user keeps browsing and the increasing number of judged pages allows the system to ``learn'' more about his goals.

3.
 Transparency: A user's browsing experience with AntWorld should be maximally similar to that in a normal browsing session with his favorite browser. Ideally, we want the user to be able to use all the browser features he is accustomed to and to see all viewed documents in the same way he normally would.

4.
Easy entry:  The system should allow a new user to start using AntWorld without installing special software, reconfiguring their computer, or going through complex sign-up procedures.

In accordance with Criterion 1, the main concept behind AntWorld is that of an information quest: a persistent information-searching process [6], identified by a numerical ID. Each quest is owned by a user, and a user can switch between his quests at will. When the user starts a quest, he provides AntWorld with an explicit description of his goals. This description can be as short as a typical search engine query (just a few words), or it can contain several sentences. As the user keeps browsing the web, he is encouraged to tell AntWorld how well the documents he finds answer his question.

Since the user has to explicitly judge a page in order for his opinion to be recorded, our technique cannot collect as much data as systems that infer the user's opinion based on indirect parameters, such as the time spent viewing a document (e.g. [4]). Motivating users to make judgments, and not merely to use judgments made by other people, may be an important issue [8]. On the other hand, we hope that the data collected via explicit judgments are more reliable.

AntWorld Architecture

AntWorld uses a client-server architecture (Fig. 1). On the client side, as described in Sec. 3, each participating user's web browser collects the user's judgments and provides a variety of user aids: ranked lists of suggested web pages, marked links to suggested pages from the the page the user is currently viewing, etc. Each client talks to the Organizational AntWorld Server (OAWS), which is shared by the members of the user community (``organization'') the user belongs to. The OAWS stores in its database profiles of all AntWorld quests ever run by the members of the community. Every time a new page judgment arrives from a client, the OAWS updates the quest profile, performs quest matching and link scoring (Sec. 5), and finally sends the updated suggestion list back to the client. The OAWS is implemented using Java servlets and a Sybase SQL server database; relevant technical details and installation instructions can be found at the project web site [1].


  
Figure 1: Data flow in AntWorld. LAPS is an optional module (Sec. 4) inserting ``ant marks'' into the document and sending the itinerary to the OAWS. Solid arrows represent real-time communication; dashed arrows represent retrieval of judged pages for indexing purposes, done in a low-priority background thread.
\begin{figure}
\centerline{
\epsfig{file=flow.eps,width=\linewidth,clip=true}
}\end{figure}

Although as of Spring 2000 there is just one Organizational AntWorld Server in active use, we expect that an organization or user group whose members desire to share their web searches will usually want to set up an OAWS of its own, to run on a computer within the organization's computer network (and within the organization's firewall, when applicable).

In a world with multiple organizations using AntWorld, we envisage three-level architecture for the entire system. The administrators of the participating Organizational AntWorld Servers may choose to periodically exchange their quest profile data via one specially designated OAWS: the AntWorld Central, using tools included into the AntWorld server distribution. For privacy reasons, exported quest data will normally be anonymized, so that they won't be identifiable with a particular person or e-mail address.

   
User Interface

To satisfy Criterion 3, AntWorld's user interface is designed as a ``browser assistant''. The user can navigate as usual in one browser window (the document window, seen in the lower part of the screen shot in Fig. 2), while another, smaller, window (the console window, in the upper part of Fig. 2) opens when the user starts (or resumes) a quest, and stays on the screen as the user works, providing two-way communication with the Organizational AntWorld Server. On the one hand, the console allows the user to access the suggestion list, generated by the OAWS based on the information AntWorld has about the user's current quest and all other quests. On the other hand, it lets the user judge (and, optionally, annotate) the page he is viewing in the document window. The user can also use the console to submit the quest description to a search engine as a query, switch to a different quest, start a new quest, view the quest summary (the complete list of judgments and annotations made during the quest), view the list of similar quests known to AntWorld, edit the description of the current quest, or tune the method AntWorld uses for finding similar quests. The front-end console functionality has been implemented using HTML and JavaScript, while most of the actual computation is done on the server side.


  
Figure 2: AntWorld session screen shot. The console window is above, the document window below. The ant icon next to Alta Vista result No. 20 indicates that the page is on the AntWorld suggestion list for this quest.
\begin{figure}
\centerline{
\epsfig{file=ps1.ps,height=8cm,clip=true}
}\end{figure}

Similar ``browser assistant'' programs have been developed for other purposes elsewhere, for example by Chakrabarti et al [3], where the console is implemented as an applet rather than a JavaScript program.

Security.

For security reasons, web browsers normally don't allow JavaScript to perform the kind of operations that our architecture requires. Specifically, the code in the console window normally would not be allowed either to monitor or to control the content of the document window, since the documents in the two windows come from different hosts. Special techniques to override these restriction are, unfortunately, browser-specific. Our implementation utilizes the Netscape security model [5, Chapter21], which involves digitally signing our JavaScript code, and requesting the necessary access privileges from the user.

Implementation options: JavaScript vs. servlets vs. applets.

In earlier versions of the system we experimented with implementing the console as JavaScript-less HTML code generated by a servlet, or as an applet. The main advantage of the first approach is its greater portability: the user does not need a browser that supports JavaScript, and is not troubled by the incompatibilities between different browsers' JavaScript interpreters. The main disadvantage of this approach is that in order for the console window to ``know'' what URL the document window contains, the AntWorld server would need to take complete control of everything the user sees. In order to capture all user's requests, the server in that implementation filtered each document and replaced each link <A HREF="$\alpha$"> with a link back to the AntWorld server:<A HREF="AntWorldServerURL?url=$\alpha$">. Besides the cost and delays involved, this approach was not particularly robust, breaking on documents that contained HTML syntax errors or that dynamically generated URLs using JavaScript.

Implementing the client as a Java applet would have offered us the advantage of a more powerful, compilable language. However, we did not pursue the applet approach beyond our 1998 version, primarily because at the time no mechanism for putting a standard Java Virtual Machine in a user's browser was available.

   
``Ant Marking'' of Links

Web browser designers have long recognized that it is useful to mark (usually, by a changed color) links pointing to documents that the user have already accessed. AntWorld goes one step further, and allows an option for ``ant marking'': marking all links to ``relevant'' documents (i.e., those that appear on the current quest's suggestion list), by inserting a small ant-shaped icon next to them. One such icon is seen in Fig. 2, near Alta Vista search result No. 20. In our experience this feature helps a user not only to collaborate with other people but also to ``collaborate with oneself'', because it lets him quickly identify those of the previously useful links that he actually found useful for this quest or a related one.

To mark links, the system must parse all HTML files requested by the user, and modify them when needed. Doing this at the OAWS is not scalable, and intervening network delays can negatively impact system usability.

Our solution is to supply each user who wants the ant marking functionality with a small client-side application (``Antscape''), whose main component is the Local Ant Proxy Server (LAPS). Antscape works like a wrapper around Netscape. It can be used to modify the Netscape preference file, for the duration of the session, so that the browser will forward all HTTP and FTP request to the LAPS, as shown in Fig. 1. The LAPS analyzes the content type of each document received, and if it is an HTML document, parses it for links.

Since all pages requested by the browser pass through LAPS, using LAPS also allows us log on the organizational server each page the user views, and not only those he explicitly judges. Although these ``user itinerary'' data are not as important for AntWorld as page judgments, they are used in some quest matching methods (See ``Using LAD'' in Sec. 5) and in usability studies.

LAPS needs to have a significant amount of knowledge about the general AntWorld setup (where the OAWS is) and the current quest (the quest ID to log the pages under; the suggested URL list). It receives this information from the OAWS. Fixed-length data are transmitted via additional headers that the OAWS includes in HTTP responses it sends to the browser when the user starts a quest, switches to another quest, or judges a page. Putting the suggestion list into HTTP headers would be impractical because of its size; instead, when the OAWS sends the suggestion list to the browser (as a JavaScript array in an HTML document), it simply indicates that fact in an HTTP header, and the LAPS parses the document body to extract the suggestion list.

Although Antscape is written in Java, which is a highly portable language, it needs to have some idea of how Netscape is installed on particular platform in order to be able to interact with it. The currently available version of Antscape works with UNIX and Microsoft operating systems. It needs no special configuring on systems where Netscape is installed in the most standard fashion, but on other machines some amount of configuring is required, depending on how the local Netscape browser has been installed and configured. In certain cases (e.g., where a local site administrator uses Netscape Mission Control to ``prohibit'' browsers from ever using a proxy server), LAPS cannot be used at all.

In accordance with Criterion 4, using the Antscape client is optional: we want most web users to be able to ``test-drive'' AntWorld without installing additional software on their desktop.

   
Quest Matching

If AntWorld is to be a useful tool for collaboration among members of a common-interest group, it needs an effective mechanism for making the user aware of the quests most similar to his current quest and guiding him to the pages that the owners of those quests found useful. To accomplish these goals, AntWorld uses a framework that consists of two stages: quest matching (QM) and link scoring.

Quest matching means finding quests in the AntWorld database which are ``similar'' to a given quest, and ranking them according to this similarity. This operation requires some quest similarity measure, $\mbox{Sim}(Q_1,Q_2)$, which reflects how relevant quest Q2 may be for a person whose goals and judgment history are encapsulated in quest Q1. Most quest matching methods employed in AntWorld are dynamic: they update the list of similar quests list each time the owner of Q1 makes a judgment, thereby providing AntWorld with additional information about his goals and preferences in this quest.

Although an AntWorld user has direct access to the lists of similar quests, it is also important to provide him with a list of suggested URLs (``links''). This ranked list is available to the user both directly in his AntWorld console and, if the Antscape client is being used, as ant marks (Sec. 4). The suggestion list consists of URLs that were judged as useful in some of the quests that AntWorld believes relevant to the user's current quest. We refer to the process of compiling this ranked list as link scoring. A link scoring method similar to the one currently used by AntWorld has been presented in our earlier report [6].

Vector space framework.

AntWorld implements a variety of quest matching methods. The quest similarity measures $\mbox{Sim}(Q_1,Q_2)$ used in most currently supported methods can be described in the traditional vector space model framework with a cosine similarity measure (see e.g. [10, Chapter 4.4]). Since the vector space models have been developed as a framework for ranking documents (texts stored in the database, which may be quite long) with respect to their relevance to queries (short sequences of terms typed by the user), such models usually distinguish between queries and documents.

For the purposes of a vector model, any query q or documents d can be represented as a vector of term frequencies $\vec{f}(q)$ or $\vec{f}(d) = \{f(d,t_1), f(d,t_2), \dots$, $f(d,t_n)\}$, where n is the number of distinct terms in the document collection, and fi is the term frequency, i.e. the number of occurrences, of term ti in the document or query in question. These vectors are usually very sparse, since every single document contains only a fraction of all n terms from the collection. In a document-ranking method using a vector model with a cosine similarity measure, the relevance of document d for query q is computed as

 \begin{displaymath}
\mbox{Sim}(q,d) =
\frac{
\vec{u}(q) \cdot \vec{v}(d)
}
{
\vert\vert\vec{u}(q)\vert\vert \, \vert\vert\vec{v}(d)\vert\vert
},
\end{displaymath} (1)

where $\vec{a} \cdot \vec{b} = \sum_t{ a(t) b(t) g(t)}$ signifies ag(t)-weighted dot product of two vectors $\vec{a}$ and $\vec{b}$, and the norm is based on the dot product: $\vert\vert\vec{a}\vert\vert =
\sqrt{\vec{a} \cdot \vec{a}}$. The design of the actual document-ranking method defines how vectors $\vec{u}(q)$ and $\vec{v}(d)$ are computed from the raw frequency vectors $\vec{f}(q)$and $\vec{f}(d)$, and what term weighting function g(t) is used. Usually, vectors $\vec{u}$ and $\vec{v}$ are designed to take the document size into account (e.g., by dividing the term frequency in a document by the document length), while function gensures that less common (and, thus, more significant) terms appear in the dot product with a greater weight than the more common ones.

AntWorld quest matching does not distinguish between ``queries'' and ``documents'', since it measures the similarity between entities of the same nature: the current quest Q1 and some other quest Q2. Thus a symmetric similarity measure $\mbox{Sim}(Q_1,Q_2) = \mbox{Sim}(Q_2,Q_1)$ seems more intuitive. Such a symmetric measure can be obtained if the vectors $\vec{u}$ and $\vec{v}$ in (1) are computed by the same formula, resulting in

 \begin{displaymath}
\mbox{Sim}(Q_1,Q_2) =
\frac{
\vec{u}(Q_1) \cdot \vec{u}(Q_2)...
...t\vec{u}(Q_1)\vert\vert \, \vert\vert\vec{u}(Q_2)\vert\vert
}.
\end{displaymath} (2)

A rather natural property of such a symmetric dot product is that the quest most similar to any quest Q1 is Q1 itself:

 \begin{displaymath}
\mbox{Sim}(Q_1,Q_1) = 1 \ge \mbox{Sim}(Q_1,Q_2)
\end{displaymath} (3)

Quest profiles and vectors.

Each AntWorld quest is represented by a quest profile: a set of pairs $(\lambda, d)$, where d is a document (either a user-supplied explicit description of quest goals, or a Web document already judged in this quest), and the weight $\lambda=\lambda(d,Q)$ represents the importance of the document d in the quest Q [6]. AntWorld's quest matching methods use various algorithms to generate the vector $\vec{u}$ (or vectors $\vec{u}$ and $\vec{v}$) for this quest from the quest profile data.

An important vector for several AntWorld's quest matching methods is $\vec{\Psi}(Q)$. This vector has a non-zero component for every term t that appears in any of the documents in the profile of Q (the set of terms we will refer to as ${\mathcal Q}$). The value of the component $\Psi(t,Q)$ component is based on the frequencies of t in the underlying documents and the importance of those documents for the quest. All quest matching methods use the same term weighting function g(t). Both $\Psi$ and g are fully described in our earlier report [6], where we presented a quest matching methods extending the document matching method proposed by Singhal [9].

Some of AntWorld's quest matching methods, identified by their internal names, will be briefly described below.

Using full quest profile vectors.

The quest matching method Relevance02 computes $\mbox{Sim}(Q_1,Q_2)$as a symmetric dot product (2), where the vector $\vec{u}(Q)$ is the entire $\vec{\Psi}(Q)$. Thus the summation in the dot product is performed over ${\mathcal Q}_1 \cap {\mathcal Q}_2$.

The only difference between this method and the one in [6] is the denominator, which was absent in the earlier report. As one might expect, introducing this normalization did improve the quality of quest matching.

This quest matching method is dynamic: $\vec{\Psi}(Q_1)$ and the list of quests similar to Q1 are updated every time the owner of Q1makes a judgment.

Using LAD.

Some of AntWorld's more interesting quest matching method are loosely based on Logical Analysis of Data (LAD) [2]. LAD computes, for two given sets of documents (the positive and negative training sets), the cut set ${\mathcal S}$: a short set of terms that best discriminate between the documents from the two sets. In the AntWorld context, the positive training set for a quest Q consists of the quest's explicit description and the documents that have been judged useful, while the negative set includes those that have been judged useless, or visited but not judged at all. Our current use of LAD is rather simple: we use the quest's cut set ${\mathcal S}(Q)$, which is typically much smaller than the quest profile term set ${\mathcal Q}$ (10-30 terms vs. 102-103), to reduce the cost of computing dot products (1) or (2).

For example, the method Relevance13 uses a symmetric dot product quest similarity measure (2) with $\vec{u}(Q_i) = \vec{\Psi}_{\rm cut}(Q_i)$: the vector $\vec{\Psi}(Q_i)$ restricted to the terms that are in the cut set ${\mathcal S}(Q_i)$. While this greatly reduces the scope of the summation (2)--from ${\mathcal Q}_1 \cap {\mathcal Q}_2$ to ${\mathcal S}(Q_1) \cap {\mathcal S}(Q_2)$, it also reduces the quest matching recall, i.e. it results in some relevant quests not being selected. Another disadvantage of this approach is that making this quest matching method dynamic would require recomputing ${\mathcal S}(Q_1)$ after every judgment--a rather expensive proposition.

A compromise approach, implemented as the method Relevance07, restricts only $\vec{\Psi}(Q_2)$ to the cut set ${\mathcal S}(Q_2)$, resulting in

\begin{displaymath}\mbox{Sim}(Q_1, Q_2) =
\frac{
\vec{\Psi}(Q_1) \cdot \vec{\Psi...
...vert\vert \, \vert\vert \vec{\Psi}_{\rm cut}(Q_2)\vert\vert
}.
\end{displaymath}

Thus, the summation is done over ${\mathcal Q_1} \cap {\mathcal
S}(Q_2)$. This solution facilitates computation as compared to the full-vector Relevance02, without unduly compromising the quality of similar quest selection. The quest matching is dynamic with respect to the current quest, because the cut set of Q1 does not need to be updated after every judgment. Admittedly, the activity of other users (owners of other quests, such as Q2) will not be reflected in the similar quest list produced by this list until the cut sets in question are recomputed. This is currently done only once every 24 hours.

Although the similarity measure used in this method is not truly symmetric, we make it obey (3) by the expedient of always computing $\mbox{Sim}(Q_1,Q_1)$ using full $\vec{\Psi}(Q_1)$ instead of $\vec{\Psi}_{\rm cut}(Q_1)$.

Currently we are also experimenting with quest matching methods which use the LAD-produced cut sets in more complex ways, e.g. by taking into account the ranks of terms in the cut sets.

Usage Prospects

As already mentioned in Sec. 1, the main goal of AntWorld is to provide a finding-sharing tool for members of common-interest user communities. We hope it to be particularly useful for group of several people whose search tasks are both complex and overlapping, and who have a real need to find answers to their questions. We are interested in collaboration with academic or industrial groups who feel that the system may be of interest to them. So far, the main users of the system have been the AntWorld project staff and the students using AntWorld as paid testers or as an assignment for a class [8]. These two categories of users accounted for 4,200 judgments and 1,500 judgments, respectively, out of the 6,000 judgment recorded in AntWorld database from June 1998 to February 2000.

While current AntWorld users do benefit from discovering one another's quests, we have also observed the system's suitability for a number of other uses:

Acknowledgments

This work is supported by the Defense Advanced Research Projects Agency (DARPA) Contract Number N66001-97-C-8537.

Most of the ideas behind AntWorld were developed by the project's Principal Investigators: Paul B. Kantor, Endre Boros, and Benjamin Melamed. Bracha Shapira and Insuk Oh conducted user studies.

Bibliography

1
The AntWorld web site, http://aplab.rutgers.edu/ant/, 1998-2000.

2
Endre Boros, Paul B. Kantor, and David J. Neu.
Pheromonic representation of user quests by digital structures.
In Proceedings of the ASIS Annual Meeting, volume 36, 1999.
Available at http://aplab.rutgers.edu/ant/papers/.

3
Soumen Chakrabarti, David A. Gibson, and Kevin S. McCurley.
Surfing the web backwards.
In The Eighth International World Wide Web Conference, May 1999.
Available at http://www8.org/w8-papers/5b-hypertext-media/surfing/surfing.html.

4
The Direct Hit web site, http://www.directhit.com, 1999.

5
David Flanagan.
JavaScript: the Definitive Guide.
O'Reilly, Sebastopol, CA, 3rd edition, 1998.

6
Paul B. Kantor, Endre Boros, Benjamin Melamed, and Vladimir Meñkov.
The information quest: A dynamic model of user's information needs.
In Proceedings of the ASIS Annual Meeting, volume 36, 1999.
Available at http://aplab.rutgers.edu/ant/papers/.

7
Paul B. Kantor, Benjamin Melamed, Endre Boros, Vladimir Meñkov, Dave J. Neu, Myung-Ho Kim, and Qin Shi.
Ant World.
In SIGIR'99: 22nd Annual International ACM SIGIR Conference, 1999.
Available at http://aplab.rutgers.edu/ant/papers/.

8
Bracha Shapira, Paul B. Kantor, and Benjamin Melamed.
Preliminary study of the effect of motivation on user behavior in a collaborative information finding system.
In SIGIR'2000: 23nd Annual International ACM SIGIR Conference, 2000.
(Submitted for publication).

9
Amit Singhal.
AT&T at TREC-6.
In NIST Special Publication 500-240: The Sixth Text REtrieval Conference (TREC 6), pages 215-226. NIST, 1997.
Available at http://trec.nist.gov/pubs/trec6/papers/att.ps.

10
Ian H. Witten, Alistair Moffat, and Timothy C. Bell.
Managing Gigabytes: Compressing and Indexing Documents and Images.
Morgan Kaufmann Publishers, Inc., San Francisco, CA, 2nd edition, 1999.

About this document ...

AntWorld: A Collaborative Web Search Tool

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -short_extn -split +0 -local_icons -address vmenkov@cs.indiana.edu dcw.tex.

The translation was initiated by Vladimir Menkov on 2000-04-04


next up previous
vmenkov@cs.indiana.edu