Integrating Multiple Sources of Information in Text
Classification
Haym Hirsh
Computer Science
Department
Rutgers University
hirsh@cs.rutgers.edu
One popular class of supervised learning problems concerns the
task of classifying items that are comprised primarily of
text. Such problems can occur when assessing the relevance
of a Web page to a given user, assigning topic labels to
technical papers, and assessing the importance of a user's
email. Most commonly this task is performed by
extrapolating from a corpus of labeled textual training documents
procedures for classifying further unlabeled documents.
This talk presents an approach for corpus-based text
classification based on WHIRL, a database system that augments
traditional relational database technology with
textual-similarity operations developed in the information
retrieval community. Not only does the approach perform
competitively when compared to state-of-the-art text
classification methods, we show that it enables the incorporation
of a range of hitherto unexploitable sources of information into
the classification process in a fairly robust and general
fashion.
Date: Friday, March 3, 2000 |
Time: 2:15-3:30PM |
Place: Gates 104 |
Return to the seminar schedule