Endre Boros, Paul B. Kantor, Dave J. Neu
Rutgers University
4 Huntington St, New Brunswick, NJ 08903
kantor@scils.rutgers.edu
http://aplab.rutgers.edu/ant/
In a novel approach to information finding in networked environments, each user's specific purpose or ``quest" can be represented in numerous ways. The most familiar is a list of keywords, or a natural language sentence or paragraph. More effective is an extended text that has been judged as to relevance. This forms the basis of relevance feedback, as it is used in information retrieval. In the ``Ant World" project [Ant World, 1999,Kantor et al., 1999b,Kantor et al., 1999a], the items to be retrieved are not documents, but rather quests, represented by entire collections of judged documents. In order to save space and time we have developed methods for representing these complex entities in a short string of about 1,000 bytes, which we call a ``Digital Information Pheromone" (DIP). The principles for determining the DIP for a given quest, and for matching DIPs to each other are presented. The effectiveness of this scheme is explored with some applications to the large judged collections of TREC documents.
The vector space model represents a document as a bag of words, in which the order of occurrence is ignored, but the frequency with which a word appears in the text is considered [Salton, 1971]. We use this approach as the starting point for our analysis.
Let us denote by ft(d) the relative frequency of term
t in document d. We shall also use ft as a generic operator,
denoting always the the relative frequency of term t in the document
to which it will be applied.
Documents are represented by the vectors
,
where
denotes a collection of terms, considered in the
particular dataset. Usually
will contain all non-stop words
that occur in the training documents.
We shall consider logical rules to describe relevance with respect
to a specified topic or quest.
For instance, if someone is looking for information on pets, the simple
rule
One possible way of finding good logical rules is essentially mimicking the popular word-game in which one has to determine a hidden subject by a series of yes-or-no questions. By asking questions, one can at each step divide the large family of possible subjects into two subfamilies, and one can achieve the results the fastest (in the sense of having to ask the smallest number of questions) if this division is as even as possible at each step.
C4.5 is an algorithm developed by Quinlan [Quinlan, 1993], which asks questions of the form is the term t occurring more than ct times? and tries to identify the term t and the threshold frequency ct in such a way that either possible answer yields the most information on the training set. To achieve this, this method proposes various information theoretic measures, and applies them in a sequential fashion, with no backtracking.
As a result, one obtains a so called binary decision tree, the
nodes of which represent subsets of the training documents for which
a particular logical statement is true. An example is included
in Figure 1, for a topic about cats as pets. The tree
in this example represents three rules: one for relevance and two for
irrelevance. The rule
Given a set of training documents with known classification of their
relevance, C4.5 recursively searches for a term t and a threshold
ct for which the branching, i.e. splitting the documents into two
groups according to the truth value of the logical proposition
,
improves the most on the homogeneity, with respect to relevance,
of these groups. At the end, all groups of documents corresponding to
the leaves of the tree are homogeneous, i.e. each
contains either only relevant, or only irrelevant documents. In
practice, this may result in a large number of branches, with very
small groups of documents, i.e. generating rules that apply only to
small subsets of the training documents. To prevent this, C4.5 stops
earlier, even if perfect homogeneity of the leaf groups is not
achieved yet. Rules generated in this way are not perfect, in the
sense that they may make (a small number of) mistakes on the training
set, but each of them is ``active'' on a larger number of training
documents.
For each generated rule R, C4.5 reports a measure of ``correctness''
.
In our experiments we use the sum S(d) of the
c(R) values of those rules which are active for document d, as the
score of d.
LAD is another technique to find logical rules characterizing the difference between relevant and irrelevant documents, based on a set of training documents [Crama et al., 1988,Boros et al., 1999].
The distinguishing feature of LAD is that it searches in an exhaustive manner for all highly relevant terms, and then separately for all reasonable rules formulated with those terms, and applies optimization models to select the best ones.
In a typical problem arising from the TREC collection [NIST, 1999], the training set consists of a few hundred relevant and irrelevant documents, containing about 12,000-15,000 different terms.
In the first phase of LAD a much smaller subset of terms
is
selected based on the fact that the frequency vectors
expressed in this basis are significantly different
for relevant training documents as opposed to irrelevant ones. For
each of these terms a ``best'' threshold value ct,
is
generated, and a document d will be represented in the following
analysis by the binary vector
,
where Xt is
defined by
In a typical easy example (where most of the learning algorithms
considered were able to learn well from the training set) the size
of the resulting support set was 3-5 times the required
minimum Hamming distance, e.g. requiring a minimum Hamming distance of
10 resulted in a smallest support set of size between 30 and 50.
Larger ratios typically occurred for more difficult training sets.
This implies that even when we demand a minimum Hamming distance of
20 or 30, for a clear, ``noiseless'' separation between relevant
and irrelevant training documents, we had to consider only 60-150
different terms for the ``good'' training sets (and a somewhat larger
number for the others.)
In the second phase of LAD, we consider the training documents
represented by the reasonable short binary vectors
,
where
is a support set chosen in the previous phase.
Simple rules in the form of elementary conjunctions are considered,
e.g.
The number of propositions Xt in the definition of the rule R is called the degree of R, while the non-zero number C+(R)for positive patterns and C-(R) for negative patterns is called the coverage of R.
The second phase of LAD has several steps. First, all patterns (with degree lower than some preset limit) are generated, subject to the requirement that the coverage of each pattern be ``large enough''. We remark that, in the worst case, the number of such patterns can be huge (unless we restrict the degree to be quite small!) However, as recent results show [Boros and Yagiura, 1999,Purdom, 1999], the expected number of acceptable patterns in randomly generated input data is limited, regardless of any degree restriction, and moreover such patterns can be generated in polynomial expected time.
Next, from this typically large set of ``good looking'' rules, we
select a smallest subset
of rules, which still can explain the
differences between relevant and irrelevant documents. This step is
formulated again as a set covering type optimization problem.
For a subset
of the possible rules and for every pair d and
d' of relevant and irrelevant documents we associate a separation measure
Given such a theory
,
we associate to any document d a score
S(d) by defining
We have selected the first 20 topics from the TREC7 competition, and carried out a large number of experiments on those. Some of these experiments were also carried out for several other topics, however many of those other topics are ``extreme'' in the sense that the number of relevant training documents is very small (i.e. there is almost nothing to learn from), and the different learning algorithms were not able to handle such extreme cases in consistent ways.
Another limiting factor in our computations is due to the C4.5 procedures themselves. In the current implementation of C4.5, the very high number of potential terms (typically in the range of 12,000 - 15,000) was just impossible to handle. We had to choose specific ways of providing a small, but meaningful subset of the terms, which then were used in C4.5. We explored three different approaches: the terms appearing in the topic description seemed to be a natural choice, and proved to be the best one. We also tried a corpus-driven approach, considering only those terms which appear at least in 3% of the training documents. Even though this choice seems very reasonable, the computational results are very poor. As a third option, we also considered an LAD-C4.5 hybrid, in which we generate a small support set by LAD, and then use C4.5 to develop the best rules using these terms. This option proved to be reasonably good, though slightly worse than the topic term set (perhaps due to the fact that LAD generated, in these examples, a smaller set of terms than those appearing in the topic description.)
The test set in TREC7 consists of about 160,000 documents, and for
each topic there are only a few hundred documents with a known
relevance classification ( judged documents). Using the indicated
scoring functions, with each variant of the algorithms we scored all
160,000 documents, considered the top 1000, and computed the average
score pave as used at TREC: Let N denote the number of judged
relevant documents for a particular topic, let
denote the ranked list of the top 1000
documents produced by the considered algorithm, and let gi denote
the number of judged relevant documents among the first i documents
in this list. Then pave is defined by
In the first two runs, Ant Route 1 and 2, we used LAD, setting the minimum required Hamming distance in the support set generation to 35 and in the theory generation to 15. In Ant Route 1 we used 100 relevant judeged and irrelevant judged training documents, while in Ant Route 2 we used up to 300 relevant judged and up to 500 irrelevant judged training documents (see Figures 2 and 3.)
In all of the other LAD runs we have used a minimum Hamming distance of 15 in the support set generation step. As a preliminary filtering, we considered only terms which appeared in more than 5% of the training documents. Furthermore, in the exhaustive pattern generation step we generated all patterns having a coverage of at least 10% of the training documents. Finally, in the theory generation step we required a minimum Hamming distance of 5.
Since we observed that C4.5 produced quite weak results when used alone (see Figure 4), we also considered a hybrid strategy that we call the two-phase method.
In the first phase, for each topic we used LAD to generate a logical formula explaining the differences between judged and unjudged documents. In the test set and in the training set, the latter are a clear majority. Using this logic, we scored all documents, and choose the top 8100 (for technical reasons). This is an approximation to the set of judged documents.
Then, in the second phase, we used C4.5, or LAD, or a hybrid of these, to develop a logical explanation distinguishing between judged relevant and judged irrelevant documents, and used this logic to score out only the 8100 test documents preselected in the first phase. Then again we chose the top 1000, and computed the paveaverage, as before (see Figures 5, 6 and 7.)
Finally, in Figure 8 we compare the best LAD results (one phase LAD run of Ant Route 2) with the best C4.5 results (hybrid run, using topic terms and tested only on the Phase I results produced by LAD.)
Overall, the result of this preliminary analysis show that the TREC routing task is a very challenging machine learning task. None of the schemes involving C4.5 performed better than a pure LAD scheme (AntRout2) that was our official submission to the TREC7 competition. That specific run, in turn, did not fare well in comparison to the median of TREC participants. Thus there is a great deal to be learned about this specific problem. We are encouraged by the fact that the two-phase method does somewhat better than a one-phase method. This may be an artifact, tracing to the fact that documents judged at TREC form some specific subclass of all documents, defined by an unwritten consensus regarding retrieval methods. However, in a real application, such as learning from the actions and judgments of an individual searching the WWW, the two-phase approach may be quite appropriate, as one theory may be needed to explain whether the user even chooses to judge a document, and another may be needed to capture the relevance feedback information represented by the user's judgments.
This work is supported by the Defense Advanced Research Projects Agency (DARPA) Contract Number N66001-97-C-8537.
![]() |
![]() |
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -short_extn -split +0 -local_icons -address kantor@scils.rutgers.edu dip.tex.
The translation was initiated by Vladimir Menkov on 1999-05-30