Seminar on Computational Learning and Adaptation


 
An Information-Theoretic  Measure of Clustering Accuracy

Byron E. Dom
Information Management Principles
IBM Almaden Research Center
San Jose, California
dom@almaden.ibm.com

The most commonly encountered unsupervised learning problem is that of clustering. The question of how one measures the accuracy of clustering results does not have a straightforward answer in most situations however. We propose a measure of clustering quality or accuracy that is appropriate in situations where it is desirable to evaluate a clustering algorithm by somehow comparing the clusters it produces with ``ground truth'' consisting of classes assigned to the patterns by manual means or some other means in whose veracity there is confidence. Such measures are referred to as ``external''. Our measure also has the characteristic of allowing partitions with different numbers of clusters to be compared in a quantitative and principled way. Our evaluation scheme quantitatively measures how useful the cluster labels of the patterns are as predictors of their class labels. In cases where all partitions to be compared have the same number of clusters, the measure is equivalent to the mutual information between the cluster labels and the class labels. In cases where the numbers of clusters are different, however, it computes the reduction in the number of bits that would be required to encode (compress) the class labels if both the encoder and decoder have free access to the cluster labels. To achieve this encoding the estimated conditional probabilities of the class labels given the cluster labels must also be encoded. These estimated probabilities can be seen as a ``model'' for the class labels and their associated code length as a ``model cost''.


Date: Thurs., March 9, 2000

Time: 4:15-5:30PM

Place: Ventura 17


Return to the seminar schedule