Appendix C Categorical variables and clustering algorithms
Huang (1998) states that standard hierarchical clustering methods can handle data with numeric and categorical values. However, this author notes that the computational cost makes them unacceptable for clustering large data sets. This, of course, is in addition to the other issues discussed in Section 4.1 regarding hierarchical methods. Huang (1998) notes that while the K-means clustering method is efficient for processing large data sets, the K-means algorithm only works on continuous data because it minimises the distance function by changing the means of clusters. This prohibits it from being used in applications where categorical data are involved.
Huang (1998) proposes the K-modes approach to deal with categorical data. However the drawback of this approach is it does not allow the combination of numeric and categorical data into a single clustering technique, where the numeric data is clustered using the K-harmonic means approach. Therefore as a middle ground we follow Ralambondrainy (1995).
Ralambondrainy (1995) presented an approach to using the K-means algorithm to cluster categorical data. Ralambondrainy's approach is to convert multiple category attributes into binary attributes (using 0 and 1 to represent whether the household displays that attribute or not) and to treat the binary attributes as numeric in the K-means algorithm. We slightly modify Ralambondrainy (1995) approach for the K-harmonic means algorithm. Huang (1998) states the drawback of this approach is that the cluster means for categorical variables, given by real values between 0 and 1, do not describe the characteristics of the clusters. However by taking a simple frequency ex post of the households in that cluster that display that attribute we are able describe the cluster's characteristics. For example by counting the number of households in the cluster who rent, then counting those who have a mortgage, and comparing them, we are able to describe whether the cluster is predominately renter or mortgage holder.
