The Treasury

Global Navigation

Personal tools

Treasury
Publication

New Zealand Households and the 2008/09 Recession

Appendix D Assessing the 'goodness of fit' of clusters

Clustering aims to partition observations into homogeneous clusters based on a set number of attributes, while observations in different clusters are heterogeneous on those attributes. In this appendix we examine how different the clusters are from one another and also which clusters are relatively more or less homogenous.

Sharma (1996) proposes a measure of the heterogeneity between clusters, RS:

Equation D.1.

where SST = Total sum of squares and is the distance between all observations as measured by the distance function; SSW = Sum of squares between clusters as measured by the distance function.

The value of RS ranges from 0 to 1, with 0 indicating no difference between clusters and 1 the maximum possible.

For the 2006/07 HES Figure D.1 plots the RS against the number of clusters one could potentially form from the sample. For 12 clusters, the number we have chosen, the RS is 0.975, indicating the clusters are very different. The graph is useful to illustrate why we chose 12 clusters, as after 12 the gains from an additional cluster become very close to zero as we see that the value of RS start to asymptote.

Figure D.1: Cluster heterogeneity (RS) and number of clusters
Figure D.1: Cluster heterogeneity (RS) and number of clusters   .
Figure D.2: Goodness of fit of the clusters
Figure D.2: Goodness of fit of the clusters   .

As the K-harmonic means algorithm forces all the households into one of the 12 different clusters, ie every observation must go into one cluster or other, there are going to be some clusters that display more within cluster variation than others, ie some clusters that fit the data better as they contain less outliers. Figure 5.2 reports the proportion of total within cluster variation that is owing to a particular cluster. The clusters generally range from between 7% and 10% in terms of within cluster variation, meaning there are not any extreme outliers in terms of within cluster variation. It is interesting to note between the two periods, cluster K becomes significantly more heterogeneous; this is hardly surprising as this is the cluster which represents older people still working and thus as it grows (as we said in Section 5.2.3 it grew 30% in size between the two periods) as older labour force participation increases we would expect more diverse people in terms of other attributes will inhabit it.

Page top