Paul Kantor, Myung-Ho Kim, Ulukbek Ibraev, Koray Atasoy
Rutgers University
4 Huntington St, New Brunswick, NJ 08903
kantor@scils.rutgers.edu
http://aplab.rutgers.edu/ant/
In assessing information retrieval systems, it is important to know not only the precision of the retrieved set, but also to compare the number of retrieved relevant items to the total number of relevant items. For large collections, such as the TREC test collections, or the World Wide Web, it is not possible to enumerate the entire set of relevant documents. If the retrieved documents are evaluated, a variant of the statistical ``capture-recapture" method can be used to estimate the total number of relevant documents, providing the several retrieval systems used are sufficiently independent. We show that the underlying signal detection model supporting such an analysis can be extended in two ways. First, assuming that there are two distinct performance characteristics (corresponding to the chance of retrieving a relevant, and retrieving a given non-relevant document), we show that if there are three or more independent systems available it is possible to estimate the number of relevant documents without actually having to decide whether each individual document is relevant. We report applications of this 3-system method to the TREC data, leading to the conclusion that the independence assumptions are not satisfied. We then extend the model to a multi-system, multi-problem model, and show that it is possible to include statistical dependencies of all orders in the model, and determine the number of relevant documents for each of the problems in the set. Application to the TREC setting will be presented.
It has long been recognized [Salton and McGill, 1983] that when estimating the effectiveness of an Information Retrieval System, one ought to take into account some indication of how easy the retrieval problem is. The standard measure of retrieval performance today is a curve which plots the precision of a retrieved set as a function of the recall of that retrieved set. This measure applies easily to modern retrieval systems which rank documents for retrieval. For any length of the ranked set, r, we can compute the number of relevant documents that have been retrieved, g(r). The precision is defined as g(r)/r, and the recall is defined as g(r)/G. In this equation G represents the total number of relevant documents for the given problem in the entire collection available for retrieval. It serves as an indication of the ease or difficulty of the problem in the sense that large G corresponds to a problem that is in some sense easier, and low G represents a problem that is in some sense difficult. If a problem is easy achieving high precision, particularly at the early ranks in a retrieval list, is not particularly impressive, as there are a great many relevant documents and presumably some of them are quite easy to find. In the most complete and authoritative evaluation framework existing today, the TREC Conferences (Harman 1997), the problem of estimating G is finessed by using a so-called pooled estimate of the number of relevant documents. In particular each of some 20 to 70 competing schemes presents a list of 1000 documents in decreasing order of their estimated relevance to the particular problem or topic at hand. The top 100 documents from each list are placed in an evaluation pool, and each unique document in that pool (many documents of course appear on multiple lists) is evaluated by a single expert for its relevance to the topic. The total number of documents judged to be relevant in this process is used as the estimate of the true number of relevant documents Gtrue. In what follows, to be more precise, we will refer to it as Gjudged. It is obvious that (barring some miraculous accident) the true number of relevant documents which we represent by Gtrue, is necessarily larger than this number. This paper is a progress report on a variety of approaches that are being explored to see whether we can make some estimate of Gtrue, based on the fact that we have information from a large number of more or less distinct retrieval systems all dealing with the same problem. We can explain the reasons for optimism by considering the so-called ``capture-recapture" method that is used in environmental management. To estimate the number of wolves in a forest, one captures a certain number N1 of wolves (presumably at random) and tags them. Sometime later (but not too long afterward) one captures another random set of N2 wolves and determines how many of them, N12, have tags. The estimate of the total population is then given N1N2/N12. This follows immediately from the observation that the probability that a wolf is captured twice is given by the product d1d2, of the individual probabilities that the wolf will be captured in each of the two trapping sessions. A method something like this was apparently used in an effort to estimate the size of the web [Lawrence and Giles, 1998], although the results of our work indicate that one should be very cautious in believing such an estimate. Before entering on the technical details we consider a number of ways in which the proposed assumption of independence might fail. Consider first the method of trapping used. Suppose for example the trap is a cage with a door of a certain size which closes automatically when an animal enters it. Even if all wolves are equally likely to be attracted by the cage, the cage will only trap those wolves that are small enough to get into it. Thus there may be some fraction of the wolf population which is not seen in any of our sampling runs and which is completely impervious to estimation by the method proposed. A second difficulty may be described as follows. Suppose we are catching not wolves, but raccoons, which are known to be quite clever with their hands. And suppose the trap has a door which can be opened by a raccoon of above average intelligence but not by an ordinary raccoon. In that case when we come round to check the traps we will have found a certain fraction of the population and not have found another. If as is more likely, the ability to escape from the trap somehow increases monotonically with raccoon intelligence, we would have to admit that our method is more heavily biased towards an estimate of the number of dull-witted raccoons, than to an estimate of the total number of raccoons. The analogy in the information retrieval situation is as follows. Documents corresponding to the wolves that don't fit into the trap are documents whose description in some way carries none of the clues that any of the retrieval systems now in use rely upon for the identification of relevant documents. As all current systems in the TREC setting base their retrieval on the texts of the documents, an obvious example (in fact, excluded from the TREC competition) would be a news article whose accompanying photograph contained useful information about a problem. Since the photograph is not itself represented to any of the systems, it cannot possibly form the basis for the retrieval. The meaning of the second notion, the dull-witted raccoons, is a little bit harder to articulate. It corresponds to the notion that in some sense some of the relevant documents are much easier to retrieve than others. We do not have a precise model for this, but it might, for example, occur when a document contains many synonyms for each of its important concepts. Thus, no matter which of those synonyms was used in a query formulated by a particular system, the document would have a chance of being retrieved. In what follows we will represent this notion instead empirically, considering that a document which is relevant is easy to retrieve if it is retrieved by a large number of the systems in a particular TREC setting.
However, in the TREC setting (this is of course not true of a setting such as the World Wide Web) we actually know the number of animals in the forest to begin with. If we assume that every animal either is a wolf or is not, that is that every document either is relevant or is not, then given a minimum of three retrieval systems, all of which trap relevant documents independently from each other, we are able to develop a set of coupled equations that permit determination of the total number of relevant documents Gtrue. What is remarkable is that these equations, shown below, do not require us to determine whether specific retrieved documents are relevant or not, but only require us to count the number of documents retrieved by each system, and to count the overlaps of the retrieved sets. What results is a set of equations with eight unknowns: the number of relevant documents, the number of documents that are not relevant, and two performance parameters characterizing each of the three systems.
The first performance parameter, called the detection power, is interpreted as the probability that a random relevant document will be in the set retrieved by that system. The second parameter characterizing each system, called the false alarm rate, is the probability that a document which is not relevant will be in the retrieved set given by that system. In the particular case of the TREC setting we have sets of size 1000 retrieved by each of the three systems, and we know the total size of the collection, so the only new quantities to be determined are the pair wise and three-fold overlap of the retrieved sets. Together, these provide eight equations for eight unknowns, which are highly non-linear.
Let
gi(t, r) be the number of relevant documents retrieved by
system i, and
gij(t, r) the number of relevant documents
retrieved by both system i and j, up to rank r, for a topic t.
We define
gij(t, r) by
We have found, for almost all topics, that we can approximately fit
by an expression of the form
,
where K(t) and
are certain constants depending
on the topic t. In a variant calculation, we used SAS non-linear
regression to fit several gi(r) simultaneously, using a single
value of K, presumed to be the asymptotic ``true" value of G,
together with topic-dependent values
.
To avoid
excessive computation (and increase the chance that we would approach a
reasonable value) we restricted this calculation to a set of 6
systems, by choosing one from each of the highest 6 pairs of systems
at TREC6.
Unfortunately, both attacks are thwarted, as the average of the
estimates again falls short of even the number of judged relevant
documents, and thus can not possibly measure the even larger number of
documents that are truly relevant. In retrospect it is obvious that
the this must fail. The true number of relevant documents is larger
than the number retrieved by even the best system, and ipso facto
larger than the average among system that are both good and bad.
Figure 1 shows the low values of the asymptotic estimate for G that result from averaging many systems.
Our final attempt to estimate the number of relevant documents, making
use of all the avaliable judgments on documents retrieved by 31 systems
participating in the TREC6 routing task, involves forming a
hypothetical super-system called the union system. The
cumulated relevant document function for this system is defined by the
following procedure. For each rank r, increase
by
the number of relevant documents appearing at rank r in any of the
lists produced by the 31 retrieval systems, and not having already
appeared previously on any of the lists. In contrast to the averaging
methods used before, this at least ensures that the curve computed
will do better than any specific system and, if TREC judging were to
strictly follow the published rules,
.
In
fact, there are slight discrepancies. However, this method of
extrapolation does tend to give estimates that are higher than the
number already known to be relevant through human judging.
In Table 1, the Topic ID refers to the TREC classification of topics.
is the average of gij over all pairs of systems,
evaluated when each system has retrieved 100 documents.
is
defined in the text.
is the total number of documents
judged relevant for each topic, as reported in the TREC qrels
files.
is the asymptote of the non-linear model, applied to
the function
,
which is the average of the estimates produced
by all possible pairs of systems. The fitting is done over the range
.
The ratio of
to
is denoted by
K/G.
is the coefficient of the non-linear regression
model, applied now to the cumulated function
,
which
includes all of the documents judged relevant by the TREC analysts.
The table has been sorted in deceasing order of the ratio of
to
.
If our model were achieving realistic estimates of the
true number of relevant documents, G, this ratio should always be
greater than 1. In fact, we see that it is greater than 1 in only 13
cases, and falls as low as 61%.
We are in possession of a mathematical tool, the coupled equations for estimating the number of relevant documents, which is very tantalizing, but cannot quite fulfill its promise. The assumptions of independence are simply not met in the available data. This of course raises the possibility that other estimates, such as the estimate of the size of the WWW, are also unreasonably low.
We can see several paths to further exploration. One is to develop
alternative models for the dependence of g(r) on r. The
exponential model, which is appropriate for infinite collections, may
not be the right one to use. An attractive alternative is found in
logarithmic models, in which
,
which provides an
extrapolation based on the true size of the collection, which is a
large number (175,000), but not infinite.
Another path for further exploration is based on attempts to model the
lack of stochastic independence among systems. If systems are
``similar'', perhaps that similarity is more or less independent of
the specific topic. This can be expressed, for example, in a
log-linear model, which sets
G
,
permitting di to depend on t, while
is
independent of t.
This work is supported by the Defense Advanced Research Projects Agency (DARPA) Contract Number N66001-97-C-8537.
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -short_extn -split +0 -local_icons -address kantor@scils.rutgers.edu findg.tex.
The translation was initiated by Vladimir Menkov on 1999-05-28