5  Income comparisons

Summary

1. The correlation between HES and IRD income is about 0.79.
2. Mean HES income is about 1.5- 6.1% higher (depending on the year) than mean IRD income (see Table 12).
3. Differences between HES and IRD income are not purely uncorrelated random noise but rather are correlated with many observable factors like education, age, ethnicity, hours worked and income. Since they are correlated on some observable variables, it is reasonable to assume they are correlated on some unobservable variables (though, naturally, this is not possible to test).

This section focuses on comparing IRD income to HES income for the linked population. There are many different ways of comparing data, and the analysis in this section does not aim to be exhaustive. This section largely focuses on total comparable income in each data source, but we also perform some separate analysis of the wage and salary component. Wage and salaries make up the bulk of comparable income (74% of HES and 77% of IRD - see Table 1 and Table 2). Self-employment income is compared in Appendix B.

Comparing Table 1 and Table 2, we see that mean HES income from all comparable sources is about 4% higher than mean comparable IRD income (\$34,609 compared with \$33,286). Table 12 shows these differences are reasonably constant over time, with mean HES income higher than mean IRD income by 1.5-6.1% depending on the year. Table 13 shows the differences are slightly smaller in percentage terms (0.8-4.1%) once we restrict to a sample of people with positive incomes in both data sets.[17]

Figure 1 presents the amount of total income earned by people within a given \$1,000 income band.[18] The HES data shows spikes, typically at \$5,000 or \$10,000 intervals, showing HES respondents often round income to the nearest \$5,000 or \$10,000.[19] Rounding of income in survey data is not surprising and has been demonstrated before (see Schroeder and Sjoquest (1976) as an early example).

Naturally, the exact amount of IRD income earned will change depending on the number of pay cycles during the year, the amount of unpaid leave taken, whether a person has received a pay rise, when income gets reported to IRD and how IRD allocates income to individuals or businesses. With these complications, a natural reporting heuristic is to approximate your income.

Figure 2 breaks down these distributions at the individual level.[20]

People who lie on the 45 degree line have the same income in both HES and IRD, people who are above the 45 degree line have higher HES income and those below have higher IRD income. Thus, the large number of people on or near the 45 degree line shows that many people report the similar income in both data sets. Given this strong diagonal, it is unsurprising that we have a strong correlation of 0.79 between the two measures of income, which increases to 0.81 for those who report positive income from both sources.

The people on the y-axis are those that have positive HES income but zero IRD income and vice versa for those on the x-axis. Table 7 in Appendix B breaks the scatterplot down into a numerical matrix. This shows that there are about 2,526 (4.8% of linked people) on the x-axis (excluding the origin) and 1,806 (3.3% of linked people) on the y-axis (excluding the origin). The 4,332 on the axis is far too large to be mostly explained by a 1.4% false-positive linkage error, which would only mismatch about 740 people, and most of these mismatches would likely not end up on the axis (most would likely end up in the interior of Figure 2 or at the origin). However, most of those on the x-axis and y-axis are not far from the origin - 79% (49%) of people on the x-axis (y-axis) have income between \$1 and \$20,000 in IRD (HES) - see Table 7. This suggests the main reason for reporting income in one data set but not the other is because the amount of income to report is small.

As shown in Table 2, 83% (24,760 divided by 29,780) of total comparable income comes from wages and salaries. Figure 3 compares HES wage and salary income with IRD wage and salary income. Given that most of total comparable income comes from wages and salaries, it is unsurprising that Figure 3 tells a similar story to Figure 2, namely a high correlation coefficient and a strong diagonal along the 45 degree line. One key difference is that, in Figure 3, we now see more people on the y-axis (ie, with HES wages but no IRD wages) than on the x-axis (with IRD wages but no HES wages). Table 8 in Appendix B shows 4,209 (7.9%) people on the y-axis and 2,556 (4.8%) on the x-axis. The fact that this asymmetry shows up in the wages and salaries data but is reversed and less pronounced in the overall income data suggests that part of the explanation is likely category hopping by HES respondents (people reporting other categories of income as wage and salary income), especially from self-employment income. Figure 5 (Appendix B) shows that self-employment has the opposite (and much stronger) pattern to the wage and salaries.

Further analysis of income subcomponents is presented in Appendix B, but as wages and salaries is such a large component of total income, the conclusions are not substantively different to those presented in this section. Income information from benefits is analysed separately in section 6.

Table 15 in Appendix E shows the association between the difference in HES income and IRD income and a number of covariates. The purpose of this is to show that the differences between the two sources is not classical random error and that, instead, the differences between the data are correlated with a number of other variables. This is done by regressing this difference in HES and IRD earnings on IRD earnings, demographic controls (gender, age, ethnicity) and the number of hours worked (reported in HES).[21] Four separate regressions separately specify the difference as a levels difference, an absolute levels difference, a log difference and an absolute log difference. The covariates in each regression are the same except that, in the regressions where the difference is logged, IRD comparable income is also logged. Most of the covariates are statistically significant in each regression, with the effect sizes generally larger than that found in Hyslop and Townsend (2016a) for SoFIE data. Often the relationship between the covariates and the differences depends on whether the difference is specified in levels or in logs. Men consistently have larger absolute differences between survey and administrative income than women, as do those over 54 relative to younger cohorts. The coefficients on hours worked are consistently low and usually insignificant.

Table 16 and Table 17 calculate reliability ratios for HES and IRD data. Further explanation is in Appendix F. The reliability ratio is a measure of the degree of agreement between the administrative and survey data. We generally find lower reliability ratios than were found between SoFIE and IRD by Hyslop and Townsend (2016a). We typically observe higher reliability ratios for the administrative data than the HES survey data, which shows that the administrative data have a lower variance than the survey data.

Notes

• [17] In addition to these tables, in order to compare to papers that take the log of income for their analysis, Table 14 shows the log of HES and IRD income over time. The average log difference between HES and IRD income is 0.016 log points, and the average of the absolute log difference is 0.3 log points.
• [18] A graph using the same technique but for a different purpose is shown in Tax Working Group (2010).
• [19] The early spikes to the left of \$20,000 are from government transfers, largely pensions.
• [20] To further protect people’s privacy, these data have been jittered, only a 5% random sample is shown and everybody with income higher than \$120,000 (or less than \$0) on either measure has been excluded. The raw scatter plots do not qualitatively differ.
• [21] These controls are similar to that used in Hyslop and Townsend (2016a), including the same binning of age. While the controls are similar, they are not the same (perhaps most notably we also include a control for IRD income). This means some differences in coefficients could be due to differences in the covariates included in addition to the underlying difference between HES and SoFIE.
Page top