Appendix B Sensitivity to weight selection
There is relatively little guidance on selecting the appropriate weights to put on each dimension in the distance function outlined in Section 4.1.3; especially when using the K-harmonic means approach in the presence of categorical variables. In forming the weights we started by assuming uniform weights on all the variables. Initial investigation of the results showed that such an approach put too much weight on the categorical variables (home ownership and qualification); that is, clusters were being formed primarily on these categorical variables rather than the other numeric data. To counter this we then started increasing the relative weights on dimensions which we thought are important in terms of describing household characteristics (primarily income and age) until we got clusters that were reasonably robust to small changes in relative weights. The additional dimensions were then added with a lower weight to help us refine the clusters (ie, distinguish between clusters of similar age and income).
To look at how sensitive our membership of the clusters are to changing relative weights we look at what happens when we zero weight income, thereby creating new relative weights (see Table B.1). Zero weighting income also allows us to address a potential criticism of our approach. This criticism is potentially owing to social mobility, there are likely to be changes in where people sit in the income distribution, meaning that between the two time periods studied the demographics (in terms of age, qualification etc) of the income earner in any given percentile income is not likely to be the same over the two time periods. This potentially opens us to the criticism that we are not really tracking 'like' people through time in terms of demographics and we are really just tracking people with 'like' incomes through time.
| Demographic Dimension | Original | New |
|---|---|---|
| Age of Primary Income Earner | 22% | 30% |
| Household Disposable Income | 26% | - |
| Number of Children | 5% | 6% |
| Qualification | 2% | 3% |
| Household Ownership | 6% | 9% |
| Proportion of government transfers (Ex WfF) | 6% | 9% |
| Proportion of Investment income | 6% | 9% |
| Proportion of Pension related income | 21% | 28% |
| Proportion of private income | 5% | 6% |
A useful device to compare the result of zero weighting income is the transition matrix, presented in Table B.2. Table B.2 shows the percentage of the original cluster that ended up in the new clusters (on the y axis). The results are encouraging, all clusters maintain between 97% and 100% of their membership, which given we zero weighted the dimension with the largest weight, gives us a reasonable degree of confidence in the stability of clusters to weight selection, and that when we track clusters through time we are tracking people of 'like' demographics. Given so many of our variables are highly correlated: age, percentage of income from pensions and home ownership status for example, different relative weights at the margin should not generate radically affect the cluster membership. This is because we minimise the distance function on many dimensions, so for an older household for example, the distance between them and their cluster centre for age, home ownership status and the proportion of income from pensions is going to be small on each dimension, therefore changing the relative weights on these dimensions is not going to materially change the composition of the cluster.
| Original clusters | Clusters excluding income dimensions | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I | J | G | K | D | F | C | H | L | A | B | E | |
| I | 100% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
| J | 2% | 97% | 1% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
| G | 0% | 0% | 99% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
| K | 0% | 0% | 0% | 100% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
| D | 0% | 0% | 0% | 0% | 98% | 2% | 0% | 0% | 0% | 0% | 0% | 0% |
| F | 0% | 2% | 0% | 0% | 0% | 97% | 0% | 0% | 0% | 0% | 0% | 0% |
| C | 0% | 0% | 0% | 0% | 0% | 2% | 98% | 0% | 0% | 1% | 0% | 0% |
| H | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 100% | 0% | 0% | 0% | 0% |
| L | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 99% | 0% | 0% | 0% |
| A | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 100% | 0% | 0% |
| B | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 100% | 0% |
| E | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 100% |
Our second robustness check is to cluster the 2009/10 dataset initially (ie, apply the algorithm to the 2009/10 dataset) and then compare these clusters to the 2009/10 clusters created under our original methodology. Table B.3 presents the results below. With the exception of cluster L, clusters generally retain between 88% and 100% of their membership, which is encouraging in terms of satisfying us of the cluster's robustness. Cluster L maintains two-thirds of its membership, still relatively high, but it does mean that relative to other clusters, we need to caution against attaching too much significance to this cluster's results.
| 2009/10 HES Clusters | Original clusters | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K | J | F | I | E | L | C | H | G | B | A | D | |
| K | 98% | 0% | 0% | 1% | 1% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
| J | 0% | 100% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
| F | 0% | 0% | 97% | 0% | 0% | 0% | 1% | 1% | 0% | 0% | 0% | 0% |
| I | 1% | 0% | 2% | 94% | 0% | 1% | 0% | 0% | 0% | 0% | 0% | 1% |
| E | 0% | 0% | 0% | 0% | 93% | 1% | 0% | 0% | 0% | 0% | 0% | 6% |
| L | 0% | 0% | 0% | 0% | 1% | 67% | 23% | 0% | 0% | 3% | 0% | 6% |
| C | 0% | 0% | 0% | 0% | 0% | 1% | 88% | 0% | 0% | 10% | 0% | 0% |
| H | 0% | 0% | 0% | 0% | 0% | 0% | 12% | 87% | 0% | 0% | 0% | 0% |
| G | 0% | 3% | 0% | 0% | 0% | 0% | 0% | 0% | 97% | 0% | 0% | 0% |
| B | 0% | 0% | 0% | 0% | 2% | 5% | 0% | 0% | 0% | 92% | 0% | 0% |
| A | 0% | 0% | 4% | 0% | 0% | 0% | 0% | 7% | 0% | 0% | 88% | 0% |
| D | 0% | 0% | 3% | 0% | 0% | 0% | 3% | 4% | 0% | 0% | 0% | 89% |
