Seminar on Computational Learning and Adaptation


 
Integrating Multiple Sources of Information in Text Classification

Haym Hirsh
Computer Science Department
Rutgers University

hirsh@cs.rutgers.edu


One popular class of supervised learning problems concerns the task of classifying items that are comprised primarily of text.  Such problems can occur when assessing the relevance of a Web page to a given user, assigning topic labels to technical papers, and assessing the importance of a user's email.  Most commonly this task is performed by extrapolating from a corpus of labeled textual training documents procedures for classifying further unlabeled documents.  This talk presents an approach for corpus-based text classification based on WHIRL, a database system that augments traditional relational database technology with textual-similarity operations developed in the information retrieval community.  Not only does the approach perform competitively when compared to state-of-the-art text classification methods, we show that it enables the incorporation of a range of hitherto unexploitable sources of information into the classification process in a fairly robust and general fashion.


Date: Friday, March 3, 2000

Time: 2:15-3:30PM

Place: Gates 104


Return to the seminar schedule