Word Vectors in the Infomap Model

In the Infomap project we build models based upon patterns of word usage. For example, consider the following quotations (from the New York Times, 1996):
HOT-FROM-THE-OVEN MEALS: Keep hot food HOT; warm isn't good enough. Set the oven temperature at 140 degrees or hotter. Use a meat thermometer. And cover with foil to keep food moist. Eat within two hours. ``Change is always happening,'' said the ebullient trumpeter, whose words tumble out almost as fast as notes from his trumpet. ``That's one of the wonderful things about jazz music.'' For many jazz fans, Ferguson is one of the wonderful things about jazz music.
The words "music" and "food" are good candidates for content-words: they are normally unambiguous and seem to have clear meanings in terms of which other words can be defined. (Though there are almost always find potentially confusing uses such as "If music be the food of love, play on".) Having chosen such words from the corpus (and that's a whole story in itself), we examine the words around. The above examples would begin to give us the following count-data:
eathotjazzmeattrumpet
Music 3 1
Food 1 2 1

We proceed through the whole corpus of documents like this, for each word building up a "number signature" which tells us how often that word appeared in the presence of each content-bearing word. Many, even most words, appear in many contexts, and some words (particularly pronouns and number words) can appear in almost any context. Some words - like jazz and meat - occur regularly in the same contexts.
The huge table or matrix built by this process gives us a profile of the way different words are used across the corpus of texts. Each word is given a list of numbers, and this list of numbers is called a word-vector. A good way to think of these numbers is as `meaning-coordinates', just as latitude and longitude associate spatial coordinates with points on the surface of the earth. If you find such abstract ideas interesting, you might want to read an introduction to vectors which I am gradually extending. (And upon which I would be very glad for any criticism - please feel free to send your comments.

One simple application of this model is word-association. You can type in a word - or a longer query - and Infomap will search for new words with similar meaning-coordinates. This allows you to select the meanings you do intend to use - and reject the ones you don't. This is a powerful tool for resolving ambiguity and narrowing down a search until your query describes exactly the meaning you desire.

Try it out now.


Return to Infomap Theory page
Last modified: Tue Jul 10 16:21:08 PDT 2001