Information Mapping Project
The goal of the Information Mapping project is intelligent,
concept-based information retrieval.
Currently, document retrieval from large databases--such as library
card catalogs or newspaper archives--is based on keyword search. A
query is posed as a list of words, and any entries in the database
which contain any or all of those specific words are returned.
However, if we treat those query words not as literal strings of
letters, but as representing concepts, then we can retrieve
relevant documents even if they do not contain the specific words used
in the query.
Our basic approach, developed by Hinrich Schütze, begins by
recording the frequency of co-occurrence between words in the text;
that is, the number of times two words appear in the same database
entry. The distribution of co-occurrences between a word and some set
of content-bearing terms then serves as a profile of the word's usage,
and can accurately associate words with similar meanings. A word can
thus be described by a spectrum of related words. The user can choose
to accept or reject these relations, thus building an increasingly
accurate profile of the meaning they desire. Generalizing this, by
comparing the query words' profiles to profiles generated for each
document, we can return articles which are conceptually related to
the query words, even if the words themselves do not appear in the
text.
The information mapping technique is naturally adaptible and has been
succesfully applied to German and Japanese with little adaptation. We
have used the concept space created by a bilingual corpus to perform cross-language information retrieval.
Ultimately, if we can request information in one language and retrieve
articles written in another, information retrieval will be freed from
the constraints of a particular language and users will be able to draw
from a rich pool of untranslated materials previously unavailable to
them.
Current Infomap research includes:
- Understanding and developing the
linguistic, logical and algebraic theory which we use to
encode and model the meaning of words.
-
Investigating how different training corpora affect
the resulting associations between meanings. You can see some of
this for
yourself by comparing search results using the different models
available on our demo. As you can see, words like "heart" have much
more specific associations in the Medical domain than in general
newswire.
-
A natural question to ask next is "how do you put together
meanings from different models?" There are strong theoretical and
practical goals which motivate this question. Does a specialist in a field
have "extra dimensions of meaning" reflecting their expert knowledge? How
does specialist knowledge relate to more general knowledge which most
adults share? How does the arrangment of concepts in one language relate
to its arrangemant in another? By modelling a "bilingual concept space",
can we use a small model built from a bilingual corpus of directly
translated documents to allow bilingual searches of untranslated
documents?
-
Detecting and resolving ambiguity. Speakers naturally
disambiguate words for one another, using simple phrases like
"Do you mean fire as in burning or fire as guns?"
We have developed algorithms which model this process.
-
Improving options for the user and methods of interaction. Our
users can add and remove aspects of a word's meaning from a
query, create clusters of word senses, and
visualize the meaning
of different words and queries. You can
learn more about these options
and try them
out on the demo.
-
Applying these techniques to build concept-based, cross-language
information systems for medical
documents. You can try out a part of this project on the
bilingual
Infomap demo.
Web Demo
Use our online search engine to create meaning-profiles and retrieve
articles related to these meanings. You can choose from a variety of
models, including the British National Corpus, North American news
articles and collections of Medical documents.
Partners
Organizations actively collaborating with the Infomap project include
References
- `The Search for Mr. Goodfile Generates New
Online Tools.' Research News, Science
276:5318, 6 June 1997.
-
Raymond Flournoy, Ryan Ginstrom, Kenichi Imai, Stefan Kaufmann,
Genichiro Kikui, Stanley Peters, Hinrich Schütze, and Yasuhiro Takayama.
Personalization and Users' Semantic Expectations.
ACM SIGIR'98 Workshop on Query Input and User Expectations,
Melbourne, Australia, August 1998.
-
Raymond Flournoy, Hiroshi Masuichi, and Stanley Peters.
Cross-Language Information Retrieval: Some Methods and Tools. In
D. Hiemstra, F. de Jong and K. Netter, eds, TWLT 14 Language
Technology in Multimedia Information Retrieval, pp. 79-83,
Universiteit Twente: Enschede, The Netherlands, 1998.
-
Genichiro Kikui. Term-list Translation Using Monolingual Word
Co-occurrence Vectors. COLING-ACL '98, Montreal, August 1998.
-
Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmann, and Stanley
Peters. Query Translation Method for Cross Language Information
Retrieval. In Proceedings of the Workshop on Machine Translation for
Cross Language Information Retrieval, MT Summit VII, pp. 30-34,
Singapore, September 1999.
-
Hinrich Schütze.
Ambiguity Resolution in Language Learning.
CSLI Publications, 1997.
CSLI Lecture Notes number 71.
-
Yasuhiro Takayama, Raymond Flournoy, Stefan Kaufmann,
and Stanley Peters.
Information Retrieval Based on Domain-Specific Word Associations.
In Proceedings of PACLING '99, Waterloo, Ontario, Canada, June 1999.
|