The Treasury

Global Navigation

Personal tools

Treasury
Publication

New Zealand Households and the 2008/09 Recession

4 Clusters

4.1 The clustering technique

Clustering techniques are commonly employed in applied data analysis, particularly marketing; an early survey of the use in this field is provided by Punj and Stewart (1983). The popularity of the approach in marketing is linked closely to the idea of market segmentation – the attempt to distinguish homogeneous groups of consumers who can be targeted in the same manner because they have similar characteristics and preferences. Given we are trying to establish a number of groups with broadly similar characteristics and preferences, this approach is attractive to us.

4.1.1 Dimensions for determining clusters

Punj and Stewart (1983) stress that the application of clustering techniques is not without its challenges. Reflecting on their meta-analysis of clustering studies, they suggest that attention to the dimensions used in determining the clusters is critical, as even one or two irrelevant dimensions may distort an otherwise useful analysis. They also state that there needs to be a rationale for inclusion, perhaps on the basis of theory or hypothesis. We start with the dimensions that we use to form our household types in Section 3 and supplement them with some additional dimensions that allow us to define our clusters more. The dimensions we use are age of highest income earner; number of children; qualification;[14] home ownership; household disposable income;[15] proportion of income from government transfer income (excluding Working for Families); proportion of income from private and public pensions; proportion of income from investments and proportion of income from private sources (excluding private pension and investments).

The first 4 dimensions seek to ensure that households have similar demographics and therefore their tastes and preferences are broadly similar. The level of disposable income is also included for this reason but also as a measure of how well the household can absorb shocks. Finally we look at the proportion of income that comes from different sources. First, this gives us the ability to create clusters with varying sensitivity to different shocks (for example, a financial/housing market shock will affect a cluster with a higher proportion of their income from investments). Second, the sources of income contain some demographic information; for example, we can distinguish between working and non-working older people by the percentage of their pension incomes relative to the percentage of wage and salary income.

4.1.2 The clustering algorithm and the distance function

Punj and Stewart (1983) identify three interrelated issues that need to be addressed when clustering. One is identifying the clustering algorithm that should be used; two is the measure of similarity between observations to use ("the distance measure"); three is how the data should be standardised. Punj and Stewart (1983) suggest that the choice of the distance function and standardisation method is not critical; hence we do not spend much time discussing our assumptions around these.

In terms of identifying the algorithm, there are two broad types: hierarchical methods and iterative partitioning methods. Put simply hierarchical methods of clustering either adopt a "bottom up" or "top down" approach. Under the "bottom up" approach the starting point is each observation in its own cluster, with pairs of clusters then merged (to a point) based on similarity. Under the "top down" approach, all observations start in one cluster and are split recursively. Iterative partitioning methods adopt a different approach, breaking the sample initially into a set number of clusters and then allocating each observation to the nearest cluster. The centre of the cluster is then iteratively moved to ensure the final positions of the clusters best fit the data. A critical difference between the methods is that the iterative partitioning method can reallocate an observation to a different cluster to better fit the data; this is not possible under the hierarchical method. On the basis of their meta-analysis of previous empirical studies, Punj and Stewart (1983) conclude that generally hierarchical methods are inferior to iterative partitioning methods, hence we adopt an iterative partitioning method.

In terms of a specific iterative partitioning algorithm to use, Punj and Stewart (1983) state that K-means (discussed below) is more robust (than other methods) in the presence of outliers, error perturbations in the distance measure and the choice of distance measure. Additionally it is not affected as much by irrelevant dimensions in determining the clusters. Owing to these reasons we use an algorithm based on K-means, but modified, as we discuss below, to deal with its weakness in the presence of random starting points.

K-means is a centre-based algorithm. The algorithm seeks to position the centre of the cluster by minimising the average distance from each of the observations in a given cluster to the cluster's centre. Closeness of any observation to the centre of a cluster(Mi) is measured by the distance measure. To calculate the distance measure for a given household, for each dimension d described above (for example age, income etc), we create an index:

Equation 4.1 .

where wd is the weight we assign to the importance of dimension d and (if the values that dimension can take are numeric) ad is the observed value of that dimension for the given household standardised to a value between 0 and 1 based on its percentile relative to all observed values of that variable in the dataset. For categorical variables where percentiles are meaningless (qualification and home ownership) we create a variable for each possible outcome and assign either a 0 or a 1 depending on whether or not the household meets that outcome and multiply it by the dimension's weight. In Appendix D we briefly review the literature around clustering techniques and categorical variables to explain why we have adopted this treatment of categorical variables.

For 1,2,...,d dimensions there is a vector:

Equation 4.2.

that describes each household; there is also a vector Mi:

Equation 4.3.

of the values of the index for each dimension d at the centre of cluster i. The distance measure for a given household j is:

Equation 4.4.

this is then summed over all individuals who are in cluster i, defined by having Mi as the closest centre.

4.1.3 The clustering algorithm

Before outlining clustering algorithm one issue that needs to be addressed is the selection of the number of clusters. We select the number of clusters by looking at the marginal addition of adding an extra cluster to the RS measure of Sharma (1996). The RS measure quantifies between cluster heterogeneity, which we are looking to maximise. We select 12 clusters because adding more clusters than 12 sees close to zero addition to the between cluster heterogeneity measure, at the cost of decreasing the sample size in each cluster and thereby reducing the statistical robustness of the results. Appendix D provides more detail on the selection of number of clusters and the RS measure.

The algorithm that clusters the data is as follows:

  1. Select K random starting points M1, M2, ..., MK from the data (as discussed above we have set the number of clusters, K, to 12).
  2. For each random point Mi, find all observations that have Mi as the closest point, using the distance measure above.
  3. Replaces Mi with the centroid (mean) across the d dimensions of all the closest observations to Mi, this becomes the new Mi.
  4. Repeat steps 2 and 3 until no cluster centre M1, M2, ..., MK changes when the centroid is calculated; that is, no improvement can be made by taking the mean of the closest observations from the points initially assigned in step 2. This is the same as saying that no household changes its assigned cluster.

The form of the K-means algorithm we use is the K-harmonic means version. [16] Let:

Equation 4.5   .

be the distance measure that describes the distance between each observation X in the whole dataset Ω and all the centres M, summed across all observations. Specifically, the K-harmonic means minimises the following distance measure:

Equation 4.6   .

As can be seen with the inside summation over i this measure considers the distance from every observation X to the centre of every cluster Mi, compared to the K-means approach which only considers the distance of X to its nearest centre. The K-harmonic means approach then seeks to find K centres which minimises this distance function.

The clusters were created by applying this algorithm to the 2006/07 Household Economic Survey dataset. In order to track how these clusters have fared post recession we then applied the centres (ie the final vector of dimensions for each cluster) to the 2009/10 dataset. We discuss how the populations of the clusters changed between the two periods when we discuss the results in Section 5.2.3. Tests of how well the clusters fit the data are reported in Appendix D, including how the clusters would change if the algorithm was initially applied to 2009/10 dataset. One important point to note is that the K-means algorithm does not make any statistical assumptions about the distribution of the variables it is clustering on and as a result all observations are included in one of the clusters ie, no observations are excluded from a cluster altogether. One possible further extension of this work would be to make statistical assumptions around variable distributions; making it possible to test the statistical similarity of individual observations to the cluster centres and thus exclude observations that are not statistically similar to any cluster centre. Such an extension is beyond the scope of this paper, but represents an avenue for further research.

Notes

  • [14] Based on the ordinal ranking system used in HES to rank qualifications from 0 (no qualification) to 8 (PhD); 5 is a bachelors degree, rather than the three categories outlined in Section 2.1. More details available on request.
  • [15] Note we use disposable income rather than equivalised disposable income growth as number of children enters as another dimension.
  • [16] Consistent with Punj and Stewart (1983), Zhang (2000) notes that the K-means method stands out, among the many clustering algorithms developed, as one of the most popular algorithms accepted by a range of applications but also the clusters it creates are very sensitive to the initial random values. The problem arises because the K-means approach minimises the distance from a data point to the closest centre. The K-harmonic means solves this problem by minimising the harmonic distance from the observations to all centres. The verification that this solves the initialisation problem is beyond the scope of this paper and the interested reader is referred to Zhang (2000) for its exposition.
Page top