CSLI
Home
Contact Us
Projects
People
Links

Information Mapping Project

Bilingual Infomap

User Interaction

Visualization

Infomap Demo


The goal of the Information Mapping project is intelligent, concept-based information retrieval. Currently, document retrieval from large databases--such as library card catalogs or newspaper archives--is based on keyword search. A query is posed as a list of words, and any entries in the database which contain any or all of those specific words are returned. However, if we treat those query words not as literal strings of letters, but as representing concepts, then we can retrieve relevant documents even if they do not contain the specific words used in the query.

Our basic approach, developed by Hinrich Schütze, begins by recording the frequency of co-occurrence between words in the text; that is, the number of times two words appear in the same database entry. The distribution of co-occurrences between a word and some set of content-bearing terms then serves as a profile of the word's usage, and can accurately associate words with similar meanings. A word can thus be described by a spectrum of related words. The user can choose to accept or reject these relations, thus building an increasingly accurate profile of the meaning they desire. Generalizing this, by comparing the query words' profiles to profiles generated for each document, we can return articles which are conceptually related to the query words, even if the words themselves do not appear in the text.

The information mapping technique is naturally adaptible and has been succesfully applied to German and Japanese with little adaptation. We have used the concept space created by a bilingual corpus to perform cross-language information retrieval. Ultimately, if we can request information in one language and retrieve articles written in another, information retrieval will be freed from the constraints of a particular language and users will be able to draw from a rich pool of untranslated materials previously unavailable to them.

Current Infomap research includes:

  • Understanding and developing the linguistic, logical and algebraic theory which we use to encode and model the meaning of words.
  • Investigating how different training corpora affect the resulting associations between meanings. You can see some of this for yourself by comparing search results using the different models available on our demo. As you can see, words like "heart" have much more specific associations in the Medical domain than in general newswire.
  • A natural question to ask next is "how do you put together meanings from different models?" There are strong theoretical and practical goals which motivate this question. Does a specialist in a field have "extra dimensions of meaning" reflecting their expert knowledge? How does specialist knowledge relate to more general knowledge which most adults share? How does the arrangment of concepts in one language relate to its arrangemant in another? By modelling a "bilingual concept space", can we use a small model built from a bilingual corpus of directly translated documents to allow bilingual searches of untranslated documents?
  • Detecting and resolving ambiguity. Speakers naturally disambiguate words for one another, using simple phrases like "Do you mean fire as in burning or fire as guns?" We have developed algorithms which model this process.
  • Improving options for the user and methods of interaction. Our users can add and remove aspects of a word's meaning from a query, create clusters of word senses, and visualize the meaning of different words and queries. You can learn more about these options and try them out on the demo.
  • Applying these techniques to build concept-based, cross-language information systems for medical documents. You can try out a part of this project on the bilingual Infomap demo.

Web Demo

Use our online search engine to create meaning-profiles and retrieve articles related to these meanings. You can choose from a variety of models, including the British National Corpus, North American news articles and collections of Medical documents.

Try it out now.

Partners

Organizations actively collaborating with the Infomap project include

References

`The Search for Mr. Goodfile Generates New Online Tools.' Research News, Science 276:5318, 6 June 1997.

Raymond Flournoy, Ryan Ginstrom, Kenichi Imai, Stefan Kaufmann, Genichiro Kikui, Stanley Peters, Hinrich Schütze, and Yasuhiro Takayama. Personalization and Users' Semantic Expectations. ACM SIGIR'98 Workshop on Query Input and User Expectations, Melbourne, Australia, August 1998.

Raymond Flournoy, Hiroshi Masuichi, and Stanley Peters. Cross-Language Information Retrieval: Some Methods and Tools. In D. Hiemstra, F. de Jong and K. Netter, eds, TWLT 14 Language Technology in Multimedia Information Retrieval, pp. 79-83, Universiteit Twente: Enschede, The Netherlands, 1998.

Genichiro Kikui. Term-list Translation Using Monolingual Word Co-occurrence Vectors. COLING-ACL '98, Montreal, August 1998.

Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmann, and Stanley Peters. Query Translation Method for Cross Language Information Retrieval. In Proceedings of the Workshop on Machine Translation for Cross Language Information Retrieval, MT Summit VII, pp. 30-34, Singapore, September 1999.

Hinrich Schütze. Ambiguity Resolution in Language Learning. CSLI Publications, 1997. CSLI Lecture Notes number 71.

Yasuhiro Takayama, Raymond Flournoy, Stefan Kaufmann, and Stanley Peters. Information Retrieval Based on Domain-Specific Word Associations. In Proceedings of PACLING '99, Waterloo, Ontario, Canada, June 1999.