Note upon publication on the WWW: This essay is reproduced in unedited form, but beware that it is 'unfinished' in that the fictional extension to the Amalthaea IR system is never actually proposed!
This paper reviews current technology in the fields of information retrieval and cross-language querying, and explores how these techniques might be integrated into a multilingual information discovery system. An extension to the Amalthaea IR system (as outlined by Moukas and Maes, 1998) is proposed, in order that it might operate in a multilingual environment. (Where the language of user queries may differ from the languages of the source document collection.) Finally, the issues surrounding selection of suitable test data and evaluation criteria for such multilingual IR systems are discussed.
Early information access systems were designed for use by professional users such as librarians, in relatively narrow application domains where the corpus to be searched was generally monolingual. It seems reasonable that someone searching a monolingual corpus should also be able to formulate a query in that language, and therefore "...the vast majority of the demand for text retrieval is well satisfied by monolingual systems." (Oard and Dorr 1996). Thus there has historically been little overlap between the fields of machine translation/computational linguistics and information access.
With the recent growth of the internet and world wide web, however, there is an increasing demand by non-expert users for information discovery systems that will allow them to effectively exploit this new multilingual environment. Although initially English was the predominant language of the internet ("...the de facto standard language of both commerce and science...", Oard and Dorr, 1996), there is an increasing body of information being published electronically in other languages. Some have estimated that within a few years less than half the total content of the world wide web will be in English, leaving much potentially valuable commercial or scientific information inaccessible to non-linguists.
In the realm of large multilingual document collections such as the world wide web, information retrieval enjoys something of a synergistic relationship with the field of machine translation. For reasons examined later, machine translation systems are rarely directly employed in the actual process of locating documents relevant to a user's query. However, matching documents are likely to need translation into the query language so that the user can read them. Often this final stage of automatic translation is integrated into the text retrieval system, as illustrated figuratively by Oard and Dorr (1996).
The remainder of the paper is organised as follows: A brief overview of a generic text retrieval system, discussion of existing work in the field of multilingual information retrieval, details of the proposed extension to Amalthaea, outline research plan for testing of the proposed system, and finally a few concluding comments about the material covered.
The goal of any information retrieval system is to satisfy a user's "information need" (Oard and Dorr, 1996), which is expressed to the system in the form of a query. Queries commonly consist of only one or two words, and from this the system must attempt to find documents relevant to the user's request. The intention is to save the user from manually examining hundreds (or thousands) of documents, but the system is only useful if a high percentage of the documents returned would have been perceived by a human searcher as relevant.
In the IR system, representation functions operate on the query and documents, in order to preprocess them into a form suitable for efficient retrieval and comparison. A typical representation function removes 'stop' words such as '"a", "of" and "the" from the source text, since these are very frequent and are not important distinguishing features. Words are conflated using stemming algorithms to remove unimportant suffices, which would otherwise inhibit term matching. The motive of all this is to make term comparison as fast as possible and reduce storage requirements for the document index. The original documents in the collection are retained separately from the automatically-compiled index, ready for eventual retrieval.
Broadly speaking IR systems can be categorized according to the type of text retrieval system incorporated: In an 'exact match' retrieval system the user's query takes the form of a boolean expression, and the system returns an unranked list of all documents that absolutely satisfy this binary judgement. In a 'ranked retrieval' system the user's query is expressed in natural language or simply keywords, and a ranked list of documents is returned. Ideally those documents at the head of the ranked list are those that correspond best to the user's query, according to some measure of similarity.
"Amalthaea is an evolving, multi-agent ecosystem for personalized filtering, discovery, and monitoring of information sites." (Moukas and Maes, 1998)
The Amalthaea project at MIT Media Laboratory is one of a number of systems in recent years that attempt to protect their users from suffering from "information overload" on the world wide web. Other such WWW agent systems include 'Webwatcher' (Carnegie Mellon University) and 'Letizia' (also developed at MIT).
The overall operation of the system is governed by "...an artificial ecosystem of evolving agents that cooperate and compete..." These agents operate under a penalty/reward regime, with feedback from the user about document relevance being distributed between the agents responsible for fetching and proposing those documents.
There are two general species of agent in Amalthaea:
Filtering agents are responsible for the personalisation of the system, their major function being that their genotype (that part of their data that 'evolves') consists of a weighted keyword vector. They issue 'requests' (consisting of a list of keywords) to information discovery agents about the type of documents they are interested in finding. Each filtering agent believes it is the perfect model of the user's interests.
Information discovery agents are responsible for finding and fetching information from different internet search engines, according the parameters encoded in their genotype. They select which information filtering agents they want to work for (based upon a history of previous profitable collaborations), send out queries to search engines, collect the results, and present these to the filtering agents who requested them.
As can be seen, in terms of information retrieval systems, Amalthaea is fairly radical: The ordering of the ranked list of documents returned is not merely determined by a term matching formula, but also by the fitness of the various filtering agents proposing documents. Also, the traditional roles have been reversed - rather than the user requesting information from the IR system, the Amalthaea's filtering agents offer the user information that they 'think' would be of interest.
However, the formula used by the information filtering agents to assess the similarity of their genotype to a given webpage is in fact a standard IR algorithm based on the vector space model: Salton and Buckley's (1988) recommended formula for combining within-document term frequencies with the IDF weight. (Divide the dot product of the two keyword vectors by the product of their magnitudes.)
There are a number of differing meanings that have become attached to the word 'multilingual' in the context of information retrieval: IR systems that can be configured to work in different languages are sometimes referred to as 'multilingual', although as Oard and Dorr (1996) point out, "...both the query and the documents must be in the same language, so such systems are actually monolingual...". Another alternative definition is IR on a multilingual (or explicitly 'parallel') document collection, but where the search space is restricted to the language in which the query is expressed (Hull and Grefenstette).
In the context of this paper, however, 'multilingual text retrieval' may be taken to mean IR where the language of the query differs from that of the documents selected. This covers all of the remaining cases cited by Hull and Grefenstette: The document collection as a whole may be either monolingual or multilingual, and in the extreme cases different languages may be used within a single document.
To clarify the distinction between cross-language querying and other 'multilingual' IR systems, research in this area of information retrieval is sometimes called 'Cross-Language Text Retrieval' - CLIR (Ogden et al) or 'Translingual IR' (Frederking et al, 1997).
At a very basic level, there are only two approaches that can be taken when it comes to the design of a multilingual information retrieval system. The query language and the document language differ, and yet a query representation must be compared with each document representation in order to determine the degree of similarity: Either the query must be translated into the document language, or the document must be translated into the query language.
In reality the latter option is almost never taken, because translating a query once (or even for every language in a multilingual collection) is much more efficient than translating every document in the collection into the query language. Even today performing document translation dynamically is not a realistic option, because current machine translation systems just aren't fast enough. If translated document representations were stored then storage costs would increase linearly with the number of supported languages (Hull and Grefenstette).
Another advantage of query rather than document translation is that a query translation module may be added to an existing IR system relatively easily, when compared with the cost of modifying the entire document base and redesigning the system for multi-lingual retrieval.
However, it has been suggested by Oard and Dorr (1996) that adopting a strategy of translating documents rather than queries might yield improved accuracy. The concern is that a one or two-word query does not give an automatic translation system much context on which to base semantic choices, e.g. deciding which translation to assign to a polysemous word (one with a number of meanings).
There are some issues concerning the use of a conventional machine translation system as a component of a multilingual IR system: In general, much effort in machine translation is expended on choosing the correct word order and on generating grammatical sentences. However, as described earlier, a typical document or query representation function will remove 'stop' words (prepositions and conjunctions) and any thereby any grammatical structure of the document, thus wasting any such effort in translation.
Not only this, but Oard and Dorr (1996) suggest that "...some of the work done by a machine translation system could actually reduce some measures of retrieval effectiveness.": Whereas a machine translation system must attempt to deduce the sense in which a polysemous word has been used (often wrongly), a multilingual IR system's custom-written query representation function can usefully preserve any ambiguities of the translation, for exploitation in the process of document/query comparison.
This scheme of preserving polysemous ambiguity is used in 'term vector translation', a popular alternative to machine translation favoured by many practitioners in MLIR. As described by Hull and Grefenstette, "...each word in the source language is mapped to all its possible definitions in the target language." In the simplest case, a retrieval system will simply search each document for all possible translations of every query term. This will have the effect of increasing the number of relevant documents found (increased 'recall') at the expense of finding some irrelevant ones (decreased 'precision'): I shall discuss IR performance metrics later.
There can be problems with this extended representation of the translated query introducing bias into an IR system: Polysemous query terms may have their importance artificially inflated compared to terms that are represented by only one translation, and some translations will occur in documents more frequently than others (Hull and Grefenstette). These issues are generally resolved by use of sophisticated term-weighting strategies, such as the 'expansion translation' described by Frederking et al. "...in which all meanings of all query terms are generated, properly weighed for base-line and co-occurance statistics, so that no meaning is lost." It is claimed that preliminary experimentation with this weighted term vector translation system "...yielded recall results comparable to a carefully hand crafted target-language query..."
However, the acquisition of information about the probability that each sense of a polysemous word is correct requires statistical analysis of a large number of documents in the target domain of the IR system.
The standard measures for assessing the performance of information retrieval systems are known as 'recall' and 'precision'. As Oard and Dorr (1996) explain "...it is assumed that documents are either relevant to the query or they're not, and that 'relevance judgement' can be reliably ascertained by a user." There is a lot of demand amongst IR researchers for test document collections with accompanying relevance judgements, since constructing and judging such text collections is a huge task.
'Precision' refers to the proportion of those documents retrieved that the user deems to be relevant to his need (found / (found + found_irrelevant)), whilst 'recall' is the proportion of those documents deemed relevant that the search successfully picked up (found / (found + notfound_relevant)). Evaluation of ranked retrieval systems is more complex, but one common measure for such systems is 'average precision'. However, comparisons are only meaningful when the same test collection of queries and documents is used.
For obvious reasons, these two measures tend to work in opposition to each other - a system tuned to high recall is likely to pick up a lot of unintended documents, whereas one tuned to high precision may not pick up any irrelevant material, but will almost certainly have missed something. It is up to the system designer to make an assessment of the relative importance of the two metrics (e.g. in a patent search, you would want to be absolutely certain that an idea has not previously been patented).
A. Moukas and P. Maes, "Amalthaea: An Evolving Multi-Agent Information Filtering and Discovery System for the WWW", in Autonomous Agents and Multi-Agent Systems, 1, p.59-88, Kluwer Academic Publishers, 1998
D. Hull and G. Grefenstette, "Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval"
Oard and Dorr, "A Survey of Multilingual Text Retrieval", Technical Report UMIACS-TR-9619, University of Maryland, 1996
D. Hull, "Using Structured Queries for Disambiguation in Cross-Language Information Retrieval", Rank Xerox Research Centre, France, 1997
R. Frederking et al., "Translingual Information Access", presented at AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, 1997
W. Ogden et al., "Keizai: An Interactive Cross-Language Text Retrieval System", Computing Research Lab, New Mexico State University