Comparing the Household Economic Survey to administrative records: An analysis of income and benefit receipt (AP 17/01)

Abstract#

We investigate the difference between people's survey responses to the Household Economic Survey (HES) and these same people's administrative income and benefit receipt records using the Integrated Data Infrastructure (IDI). 83% of people can be linked to the IRD data within the IDI. Those that are linked have higher incomes, are more likely to be male, are more likely to be of European descent and have lower reported benefit receipt than those who are not linked. HES reported total incomes (excluding benefit income and some other categories) are typically 1.5-6% higher than administrative measures of these same people's income. Despite this difference, there is still a strong correlation between HES and administrative income (about 0.79). On the other hand, reported benefit receipt in HES correlates poorly with the administrative measure. Benefits are under-reported on average, but many people also report receiving benefits that the administrative record says that they did not receive.

Acknowledgements#

The authors are grateful to members of Treasury's Analytics and Insights team, especially the many helpful comments from Sarah Crichton and Sylvia Dixon. We would also like to thank Sarah Dovey from Statistics New Zealand for providing comments on an earlier draft. Finally we are grateful for conversations about this paper with Dean Hyslop. Naturally, any remaining errors are the sole responsibility of the authors.

Code Availability#

The code used to produce the statistics used in this report can be viewed (but not run) at the following GitHub address: https://github.com/Treasury-Analytics-and-Insights/HES_IDI_Comparison

Disclaimer#

The views, opinions, findings, and conclusions or recommendations expressed in this report are strictly those of the authors. They do not necessarily reflect the views of the New Zealand Treasury, Statistics New Zealand or the New Zealand Government. The New Zealand Treasury and the New Zealand Government take no responsibility for any errors or omissions in, or for the correctness of, the information contained in this Analytical Paper.

The results in this report are not official statistics, they have been created for research purposes from the Integrated Data Infrastructure (IDI) managed by Statistics New Zealand.

Access to the anonymised data used in this study was provided by Statistics NZ in accordance with security and confidentiality provisions of the Statistics Act 1975. Only people authorised by the Statistics Act 1975 are allowed to see data about a particular person, household, business or organisation and the results in this paper have been confidentialised to protect these groups from identification.

Careful consideration has been given to the privacy, security and confidentiality issues associated with using administrative and survey data in the IDI. Further detail can be found in the Privacy impact assessment for the Integrated Data Infrastructure available from www.stats.govt.nz.

The results are based in part on tax data supplied by Inland Revenue to Statistics NZ under the Tax Administration Act 1994. This tax data must be used only for statistical purposes, and no individual information may be published or disclosed in any other form, or provided to Inland Revenue for administrative or regulatory purposes.

Any person who has had access to the unit-record data has certified that they have been shown, have read, and have understood section 81 of the Tax Administration Act 1994, which relates to secrecy. Any discussion of data limitations or weaknesses is in the context of using the IDI for statistical purposes, and is not related to the data's ability to support Inland Revenue's core operational requirements.

Executive summary#

Researchers who use survey data have an interest in understanding how well it compares with administrative data. We explore the difference between people's survey responses to the Household Economic Survey (HES) and administratively recorded information on income and benefit receipt using the Integrated Data Infrastructure (IDI).

The motivation for this analysis is the potential to incorporate administrative data into Treasury's tax and welfare model, which is currently based on the Household Economic Survey (HES) data. It has long been recognised that there are some significant limitations with this approach, in particular, a relatively small sample size and the under-reporting of welfare payments. For some research questions, it is debated which data - survey or administrative - are more appropriate. However, in the case of modelling tax and welfare changes, it can be argued, a priori, that administrative data are more appropriate, given that administrative data better reflect the tax revenue the government actually receives and the benefit expenses the government actually incurs.

83% of adults (aged 15+) in HES can be linked to the IRD data within the IDI. Those that are linked - compared to those unlinked - have higher incomes, are more likely to be male, are more likely to report European ethnicity and have a lower level of reported benefit receipt. Because the linked sample differs quite substantially from the unlinked sample, researchers using only the linked subset should consider reweighting the data to better reflect the characteristics of the New Zealand population.

HES reported total comparable incomes (excluding benefit income) are highly correlated (ρ=0.79) with IRD incomes, although HES incomes are 1.5-6% higher (depending on the year) than the administrative measures.

Benefit receipt in HES compares poorly with the administrative measure. Many people report either HES benefits for which there is no administrative record that they received it or do not report receiving a benefit when an administrative record shows that they received it. On average, people are much more likely to under-report a benefit than over-report a benefit, and this is especially true of the Accommodation Supplement.

Our results indicate that incorporating administrative data on benefit receipt is where the largest gains are likely to be made to Treasury's Taxwell model. Following that, incorporating data from IRD on income is the next logical step. As part of making these changes, the survey weights and population definition will likely need to be changed.

1 Introduction#

Survey data, such as Statistics New Zealand's Household Economic Survey (HES), are frequently used for research and modelling work, but there are a number of reasons why survey responses may be inaccurate. First, respondents may struggle to remember what they received. Second, respondents may struggle to understand the questions, particularly with social transfers with similar target groups like Supported Living Payment and Disability Allowance. Even when respondents understand the questions, they may make approximations, such as rounding earnings to the nearest $10,000 or rounding time periods to the nearest month. In other cases, respondents may feel stigmatised or feel as if their privacy is invaded if they answer truthfully - for example, when reporting unemployment assistance or very large income items. Lastly, but not exhaustively, some respondents may report deliberately misleading information. These issues are not unique to survey data, and some of these and other issues may affect administrative data. Statistical agencies and researchers are aware of such issues and have ways of managing them both ex ante, during the design of the survey and ex post, in a way suitable for the research question. This paper presents comparisons between survey measures and administrative measures, which allows researchers using the HES survey data to better understand their quality.

"The three main objectives of HES are: to contribute to the reweighting of the consumers price index; to supply expenditure statistics for use in estimating gross domestic product, and to provide an indication of the overall living standards of New Zealanders".[1] Similarly, IRD data and MSD data are collected for administration of the tax and welfare system. Any differences between HES data and administrative sources does not imply that either data source isn't fit for purpose. Nevertheless, it is helpful to understand the nature of such differences.

In 2016, HES data was linked to Statistics New Zealand's Integrated Data Infrastructure (IDI). This has made it possible to compare the reported survey data with the administrative records at the individual level. Previously, HES could only be compared to other data sources at an aggregate level, where the measures were calculated using different people. 83.3% of adults in the HES sample aged 15 years or over were linked to the Inland Revenue Department (IRD) register.[2]

Using these data, we perform three strands of analysis. First, we compare the linked HES respondents to the unlinked respondents to understand whether the linked sample is representative. Second, we compare income reported in HES to income recorded by IRD. Third, we compare benefit data reported in HES to that reported by the Ministry for Social Development (MSD). These strands were chosen because the analysis helps inform issues relevant to incorporating administrative data into Taxwell, Treasury's tax and welfare microsimulation model.

The rest of this paper is outlined as follows. Section 2 briefly compares this paper to prior literature. Section 3 describes the data including how the data are linked and the concordance between HES incomes and IDI measures of incomes. Section 4 compares the linked population to the unlinked population. Section 5 presents the comparisons of HES incomes to IRD incomes. Section 6 presents comparisons of HES reported benefits to MSD records of benefit receipt. Section 7 concludes.

Notes

[1] Statistics New Zealand website: http://www.stats.govt.nz/browse_for_stats/people_and_communities/Households/household-economic-survey.aspx accessed 1 September 2016.
[2] An additional 1.5% of the adult sample could be linked to the IDI spine but not the IRD data. The IDI spine consists of data from IRD, the Department of Internal Affairs and Immigration.

2 Previous literature#

There is a sizeable international literature comparing survey measures of income to tax-based administrative measures of income.[3] Indeed, because of the recent linking of several surveys to administrative data in New Zealand, there is now a modest number of New Zealand papers on the differences between survey and administrative data. Of these, the most closely related to ours are Suei (2016), who compares the New Zealand 2013 Census to the IRD data we use within the IDI, and Hyslop and Townsend (2016a, 2016b) who compare the IRD data to New Zealand's Survey of Family Income and Employment (SoFIE).

The techniques used in Suei (2016) are different to those used in our paper because Census income is recorded in bands rather than in precise amounts as in HES, IRD and SoFIE. On the whole, Suei finds similar results for Census-IRD comparisons as we find for HES-IRD comparisons, that is, a strong correlation between survey and administrative measures of income. However, whereas we find HES income is on average slightly higher than IRD income, Suei finds that Census income is on average slightly lower than IRD reported income.[4] Part of this difference is due to the way we treat missing records in IRD. In our paper, we treat people with an IRD number but no recorded income in a year as earning zero income in that year, whereas Suei treats their IRD income as unknown. However, when we side-step this issue by subsetting to people with positive reported incomes in both sources, we still find higher average incomes in the HES data compared with the IRD data.

Hyslop and Townsend (2016a) also find similar results comparing SoFIE and IRD incomes as we do comparing HES and IRD incomes. The key difference between their results and ours is that they find a slightly closer relationship between their data sets then we do, and whereas we find HES incomes tend to be slightly higher than IRD incomes, they find SoFIE measured incomes tend to be slightly lower. For example, they find that mean SoFIE earnings are between 1% lower to positive 3.5% higher than mean IRD earnings (depending on the sample), whereas we find mean income in HES are 1.5-6% higher than mean IRD incomes (depending on the year).[5] In addition, Hyslop and Townsend find log SoFIE earnings are 0.02-0.04 log points lower than log IRD earnings, whereas log HES earnings are about 0.02 log points higher than log IRD earnings. Hyslop and Townsend find reliability ratios for

SoFIE-IRD that are somewhat higher than what we find, with SoFIE reliability ratios of 0.83-0.85 and IRD reliability ratios of 0.87-0.91, whereas we find reliability ratios of 0.78 for HES and 0.82 for IRD.[6]

Somewhat relevant to these comparisons is Ball (2016). Though not the main purpose of his paper, Ball compares the samples (but not the different sources of earnings) in the IDI-linked HES, SoFIE and Census 2013 data. Ball shows that the linked HES sample of people has higher IRD reported income than the linked SoFIE and Census sample’s IRD income.

Other papers in New Zealand comparing survey and administrative data include Samoilenko and Law (2014) who found very large discrepancies between reported KiwiSaver enrolment in SoFIE and administrative data from IRD (with KiwiSaver enrolment under-reported by 50%). Chapple and Crichton (2012) (in part) compare Household Labour Force Survey (HLFS) reports of benefit receipt to MSD records for the same individuals and find significant numbers of beneficiaries (based on MSD records) are not reporting Unemployment Benefit receipt in the Household Labour Force Survey (HLFS). As we will see, these results are consistent with what we find when comparing HES to MSD benefit records.

Notes

[3] Studies comparing survey and administrative reports of income for the same people go back at least to Miller and Paley (1958) who match data for about 3,900 respondents to the 1950 US Census and compare this to IRS data. They found that median wage and salary income for matched families was about 4.6% ($3,570 vs. $3,412) higher in IRS data than Census data. Prominent early papers include Pischke (1995), Bound and Krueger (1991) and Bound, Brown, Duncan and Rodgers(1994). Recent contributions include Abowd and Stinson (2013), Kapteyn and Ypma (2007) and Britton, Shephard and Vignoles (2015). Bound, Brown and Mathiowetz (2001) provide an overview of measurement error in survey data including the effects on estimates, methods for correcting for measurement error and the empirical evidence from validation studies on measurement error's nature and extent.
[4] Census income is recorded in bands, so mean differences are not computed. However, Suei finds that 33% of people are in a lower income band for Census-measured total income compared with the tax data, while 25% are in a higher band in Census. This is also despite missing investment income information in the IRD data. Suei argues that "substantial conceptual differences between the two sources [of income] may be a key contributing factor" to these differences (Suei, 2016, p. 23).
[5] See Table 12.
[6] The reliability ratio is a measure of agreement between two measures of the same phenomenon. They are calculated as the ratio of covariance between the two measures to the variance of each measure with reliability ratios closer to 1 indicating higher agreement. More details and discussion on reliability ratios can be found in Appendix F.

3 Data#

Each year, Statistics New Zealand conducts the Household Economic Survey (HES), which collects, among other things, information on household expenditure, individual income and individual benefit receipt. The interview period runs from 1 July through to 30 June each year, with people asked about income and benefit receipt during the last 12 months.[7] The linked data cover HES surveys from 2006/07 to 2014/15. Throughout our paper, we exclude children aged 14 years or less.[8]

Typically, there are about 5,500-7,000 adults per HES year, except for HES 2014/15 where there are 11,000.[9]The HES data were linked to the IDI data in early 2016, with this paper using data from the September IDI release. Of adults aged 15 years or more in the HES surveys from 2006/07 to 2014/15, 83.3% were linked to the IRD data, with a further 1.5% of people linked to the IDI spine but not the IRD data. Table 11 in Appendix D shows that this link rate is stable over time. Because we cannot compare incomes for people not linked to the IRD data and because comparisons of incomes is central to our analysis, our study focuses only on those who can be linked to the IRD data. References in this paper to the linked population refer to those linked to the IRD data.

For most of our analysis (except when analysing benefit receipt), we pool the data across all the HES years from 2006/07 to 2014/15, and we use the income data without inflation adjustment.[10]

Notes

[7] Often, Statistics New Zealand and others refer to 'HES’ as the triannual expenditure survey and 'HES (Income)’ as the shortened income survey in the intervening years. Since we do not (and cannot) compare expenditure data in this paper, we use 'HES’ to refer to both 'HES’ and 'HES (Income)’ surveys.
[8] While some children do earn income (which can be seen in the IRD data), HES does not collect income data from children.
[9] Details about the exact number of people in each year can be found in Appendix D.
[10] As a robustness check, we performed much of the comparative of the analysis in this section using only HES 2014/15, and the conclusions did not substantively differ.

3.1 Interpreting the differences between the two data sources#

There are a number of reasons why the survey data might differ from the administrative data.

Conceptual differences: The income measured in HES and IRD is not immediately comparable as some categories of income in HES (such as overseas income) are not collected in IRD.[11] In order to compare the two sources, we developed a concordance between the two (detailed in the next section). It is possible to match HES income to IRD income in the following categories: wages and salaries, self-employment, pensions, benefits, paid parental leave, student allowances, sole-trader income (including rental income) and partnership income. Our measure of total income consists of the above matched categories, excluding benefit income. The benefit data are analysed separately in section 5.

Table 1 and Table 2 show that 82% of HES income and 96% of IRD income is included in our defined total comparable income.[12]

Linkage error: Some of the people in HES will have been linked to the wrong person in the IDI. Statistics New Zealand estimates about 1.4% of people linked in HES are linked to the wrong person.[13]

Errors in the survey data: These occur when people incorrectly report their income in HES. For example, they don't remember how much they earned and guess or they report income in the wrong category or they round income up or deliberately misreport benefit receipt or income (for example, when people feel stigmatised).

Errors in the administrative data: These could occur when the data are processed (including being put into the IDI) and could relate to the amount or the timing of these earnings.

Administrative data also do not include 'under the table' earnings that are not reported to IRD but may or may not be reported in HES.[14]

When we observe differences between the survey and administrative data, it is not possible to determine for certain which of the above factors is responsible. In general, the data set best suited to a particular research question will depend on the question. For example, the argument for preferring IRD income to HES income for tax and welfare modelling can be made independently of the comparative analysis, as IRD income, not survey income, is the basis for estimating tax liability and so IRD income is more likely than survey measured income to reflect the government's tax revenue.

Despite the difficulty of assigning a particular reason for individual discrepancies, it is still useful to analyse these differences in order to assess their extent and nature. For example, the more similar the two data sets, the less likely any analysis is to depend on the data set.

Notes

[11] Conceptual differences are not likely to be a problem for the benefit data.
[12] Of the 18% of HES income that is not compared, a third (6.2 percentage points) is income from benefits (which are compared elsewhere in this document), 3.6 percentage points is in the category of investment income, 2.6 percentage points is income earned overseas, 2.5 percentage points is income classified as other regular income and 3 percentage points is from irregular income - see Table 1. The 4% of income from the data.income_tax_yr_summary table that we decided not to include in total comparable income is almost exclusively comprised of benefit payments (3.3%) with the remainder is ACC payments (0.6% - see Table 2.).
[13] This figure comes from Statistics New Zealand documentation available in the IDI. For details on how links and the link quality are determined, see Statistics New Zealand (2013) and especially Statistics New Zealand (2014).
[14] Since tax is not paid on this illegal income, excluding it when modelling tax takings is appropriate.

3.2 Income concordance#

HES asks people about their income and benefit receipt in the past 12 months rather than over the last tax year. In order to compare to the administrative measures of income to HES, we take the relevant time period from IRD income data as being the prior 12 months including their full interview month. This partially addresses a limitation with the IRD data, which uses month as the most granular unit. For these comparisons, IRD data available only as annual amounts have been equally distributed across the 12 months in the year.

It is not straight forward to develop a concordance between HES and IDI income. There are many categories, such as overseas income and investment income, that are included in HES but not in the IDI. Similarly, there are two categories (benefits and ACC claims) of IDI income that we do not compare to HES income. We have excluded from HES the following categories that are not easily mapped to IRD categories: most forms of investment income (interest from bonds, stocks and managed funds, bank deposits, dividends etc.), benefits (which are analysed separately in section 5), overseas incomes and some types of New Zealand government pensions (we capture New Zealand Superannuation, Veteran's Pension, War Disablement Pension and the Surviving Spouse Pension).[15]

For the exact income categories we used, see Table 5 in Appendix A, which gives the fine-grained income codes included in the total and in each subcategory of income (wage and salary income, self-employment, partnership, etc.).

Notes

[15] The pension category excluded is HES category 3.1.0.05 Other type of New Zealand Government pension.

4 Comparing the linked people to the unlinked people and summary statistics#

Summary

The linked and unlinked population are different on socioeconomic and demographic characteristics. Relative to the unlinked population, people in the linked population have higher incomes, are more likely to be male, more likely to report European ethnicity and are less likely to be on a benefit.
Restricting analysis to the linked HES-IDI population means the survey weights would need to be adjusted to make the linked population representative of the New Zealand population.
About 82% of HES income and 96% of IRD income is comparable. Of the HES comparable income, 84% of this income comes from wages and salaries, while for IRD income, this figure is 77%.

This section provides summary statistics for the HES and IRD income data and also compares the linked sample to the unlinked samples across selected socioeconomic and demographic variables. We compare the linked to the unlinked sample on HES variables, as administrative data are, by definition, not available for unlinked people. Caution is required interpreting comparisons based on HES reported measures, as there may be a correlation between being linked and reporting error.[16] All comparisons are on characteristics that can be assigned to the individual. All results are unweighted.

Table 1 compares the linked and unlinked populations across demographic and socioeconomic variables available from HES, while Table 2 shows summary statistics of IRD income (which is naturally only available for the linked population). Most variables in Table 1 show statistically significant differences across the linked and unlinked population, though absolute differences (statistically significant or not) are important too. In general, the linked HES population has higher average income, has a higher average age, is more likely to be of European ethnicity, is more likely to be male and is less likely to be on a benefit.

We have not looked at whether these distributional changes make the linked subsample more or less representative of the New Zealand population. However, because the linked population is different to the unlinked population, it follows that different survey weights would need to be calculated. This calculation is beyond the scope of this paper, and the relevant variables to weight on would depend on a study's research question.

There are plausible reasons why we would expect the link rate differentials observed. Females are more likely to change their name due to marriage (and divorce), and some ethnicities may have more variety in name transliteration or shortening - all of which would present difficulties in the linking process. These groups have lower incomes on average, which may explain why the unlinked populations have lower average incomes.

Table 2 shows summary statistics for IRD data. Naturally this is only for the linked population. The next section compares Table 1 and Table 2.

Table 1: Linked vs unlinked population on selected variables within HES
	Unlinked (1)	Linked (2)	Difference (3)	Percent difference (4)
Number of people	10,626	53,136
Income († = not part of compared income)
Total HES income - all categories including those not in IRD (mean)	36,460	41,425	-4966***	-12%
HES compared income (mean)	29,884	34,609	-4,725***	-14%
Wages and salaries (mean)	25,079	29,052	-3,973***	-14%
Self-employment income (mean)	1,754	2,098	-343	-16%
Benefit income (mean)†	2,262	2,019	244***	12%
Investment income (mean)†	1,326	1,603	-277**	-17%
Overseas income (mean)†	930	613	317***	52%
Other regular income (mean)†	907	1,182	-275***	-23%
Irregular income (mean)†	1,074	1,261	-187	-15%
Demographics
Age (mean)	43.7	45.9	-2.2***	-5%
Female (proportion)	0.543	0.523	0.02***	4%
European (proportion)	0.678	0.779	-0.101***	-13%
Māori (proportion)	0.143	0.113	0.03***	27%
Pacific (proportion	0.083	0.056	0.027***	48%
Asian (proportion)	0.132	0.082	0.05***	61%
Middle Eastern/Latin American/African (proportion)	0.014	0.009	0.005***	56%
Other ethnicity (proportion)	0.03	0.031	-0.001	-3%
New benefits (HES 2014/15 only)
Proportion with JSS	0.05	0.041	0.01*	24%
Proportion with SPS	0.029	0.021	0.008**	38%
Proportion with SLP	0.032	0.026	0.006	23%
Old benefits (HES 2006/07-2012/13)
Proportion with UB	0.027	0.02	0.007***	35%
Proportion with SB	0.023	0.019	0.004**	21%
Proportion with DPB	0.033	0.028	0.005**	18%
Proportion with IB	0.024	0.023	0.001	4%

Notes: This table reports comparisons between linked and unlinked HES members across a range of variables. Dollar values are rounded to the nearest dollar, mean age to one decimal place and proportions to three decimal places. Due to rounding, the difference in column (3) may not be the same as the difference of column (1) and (2). Column (4) uses IRD income as the denominator in the percentage difference and is based on rounded data and so is quite approximate for the smaller proportions. Stars denote: * p

Table 2: IRD income composition (linked sample)
	Mean income	Percentage with non-zero income	Mean conditional on non-zero income
Wages and salaries	25,497	60.5%	42,115
Benefits†	1,113	11.7%	9,541
IR20 director Income	1,659	5.1%	32,523
PAYE director Income	851	2.0%	42,326
Withholding tax director income	9	0.0%	20,870
ACC payments†	201	2.0%	10,030
IR20 partner income	652	3.7%	17,852
PAYE partner income	29	0.1%	31,396
NZ Superannuation	2,879	18.2%	15,814
Paid parental leave	52	1.1%	4,713
IR3 income (sole-trader)	956	5.7%	16,834
Sole-trader receiving PAYE deducted income	5	0.0%	18,655
Sole-trader withholding tax income	556	3.1%	17,727
IR3 rental income	34	1.8%	1,815
Student allowance	107	2.2%	4,854
IRD income - all sources	34,600	90.0%	38,432
Total income - comparable sources	33,286	84.5%	39,393

† non-compared category

Notes

[16] For example, if individuals misreport their age, it is unlikely they will be correctly linked.

5 Income comparisons#

Summary

The correlation between HES and IRD income is about 0.79.
Mean HES income is about 1.5- 6.1% higher (depending on the year) than mean IRD income (see Table 12).
Differences between HES and IRD income are not purely uncorrelated random noise but rather are correlated with many observable factors like education, age, ethnicity, hours worked and income. Since they are correlated on some observable variables, it is reasonable to assume they are correlated on some unobservable variables (though, naturally, this is not possible to test).

This section focuses on comparing IRD income to HES income for the linked population. There are many different ways of comparing data, and the analysis in this section does not aim to be exhaustive. This section largely focuses on total comparable income in each data source, but we also perform some separate analysis of the wage and salary component. Wage and salaries make up the bulk of comparable income (74% of HES and 77% of IRD - see Table 1 and Table 2). Self-employment income is compared in Appendix B.

Comparing Table 1 and Table 2, we see that mean HES income from all comparable sources is about 4% higher than mean comparable IRD income ($34,609 compared with $33,286). Table 12 shows these differences are reasonably constant over time, with mean HES income higher than mean IRD income by 1.5-6.1% depending on the year. Table 13 shows the differences are slightly smaller in percentage terms (0.8-4.1%) once we restrict to a sample of people with positive incomes in both data sets.[17]

Figure 1 presents the amount of total income earned by people within a given $1,000 income band.[18] The HES data shows spikes, typically at $5,000 or $10,000 intervals, showing HES respondents often round income to the nearest $5,000 or $10,000.[19] Rounding of income in survey data is not surprising and has been demonstrated before (see Schroeder and Sjoquest (1976) as an early example).

Naturally, the exact amount of IRD income earned will change depending on the number of pay cycles during the year, the amount of unpaid leave taken, whether a person has received a pay rise, when income gets reported to IRD and how IRD allocates income to individuals or businesses. With these complications, a natural reporting heuristic is to approximate your income.

Figure 1: Income distribution in HES and IRD

Figure 2 breaks down these distributions at the individual level.[20]

Figure 2: HES overall income vs IRD overall income (with 45 degree line)

People who lie on the 45 degree line have the same income in both HES and IRD, people who are above the 45 degree line have higher HES income and those below have higher IRD income. Thus, the large number of people on or near the 45 degree line shows that many people report the similar income in both data sets. Given this strong diagonal, it is unsurprising that we have a strong correlation of 0.79 between the two measures of income, which increases to 0.81 for those who report positive income from both sources.

The people on the y-axis are those that have positive HES income but zero IRD income and vice versa for those on the x-axis. Table 7 in Appendix B breaks the scatterplot down into a numerical matrix. This shows that there are about 2,526 (4.8% of linked people) on the x-axis (excluding the origin) and 1,806 (3.3% of linked people) on the y-axis (excluding the origin). The 4,332 on the axis is far too large to be mostly explained by a 1.4% false-positive linkage error, which would only mismatch about 740 people, and most of these mismatches would likely not end up on the axis (most would likely end up in the interior of Figure 2 or at the origin). However, most of those on the x-axis and y-axis are not far from the origin - 79% (49%) of people on the x-axis (y-axis) have income between $1 and $20,000 in IRD (HES) - see Table 7. This suggests the main reason for reporting income in one data set but not the other is because the amount of income to report is small.

As shown in Table 2, 83% (24,760 divided by 29,780) of total comparable income comes from wages and salaries. Figure 3 compares HES wage and salary income with IRD wage and salary income. Given that most of total comparable income comes from wages and salaries, it is unsurprising that Figure 3 tells a similar story to Figure 2, namely a high correlation coefficient and a strong diagonal along the 45 degree line. One key difference is that, in Figure 3, we now see more people on the y-axis (ie, with HES wages but no IRD wages) than on the x-axis (with IRD wages but no HES wages). Table 8 in Appendix B shows 4,209 (7.9%) people on the y-axis and 2,556 (4.8%) on the x-axis. The fact that this asymmetry shows up in the wages and salaries data but is reversed and less pronounced in the overall income data suggests that part of the explanation is likely category hopping by HES respondents (people reporting other categories of income as wage and salary income), especially from self-employment income. Figure 5 (Appendix B) shows that self-employment has the opposite (and much stronger) pattern to the wage and salaries.

Figure 3: HES wages and salaries vs IRD wages and salaries

Further analysis of income subcomponents is presented in Appendix B, but as wages and salaries is such a large component of total income, the conclusions are not substantively different to those presented in this section. Income information from benefits is analysed separately in section 6.

Table 15 in Appendix E shows the association between the difference in HES income and IRD income and a number of covariates. The purpose of this is to show that the differences between the two sources is not classical random error and that, instead, the differences between the data are correlated with a number of other variables. This is done by regressing this difference in HES and IRD earnings on IRD earnings, demographic controls (gender, age, ethnicity) and the number of hours worked (reported in HES).[21] Four separate regressions separately specify the difference as a levels difference, an absolute levels difference, a log difference and an absolute log difference. The covariates in each regression are the same except that, in the regressions where the difference is logged, IRD comparable income is also logged. Most of the covariates are statistically significant in each regression, with the effect sizes generally larger than that found in Hyslop and Townsend (2016a) for SoFIE data. Often the relationship between the covariates and the differences depends on whether the difference is specified in levels or in logs. Men consistently have larger absolute differences between survey and administrative income than women, as do those over 54 relative to younger cohorts. The coefficients on hours worked are consistently low and usually insignificant.

Table 16 and Table 17 calculate reliability ratios for HES and IRD data. Further explanation is in Appendix F. The reliability ratio is a measure of the degree of agreement between the administrative and survey data. We generally find lower reliability ratios than were found between SoFIE and IRD by Hyslop and Townsend (2016a). We typically observe higher reliability ratios for the administrative data than the HES survey data, which shows that the administrative data have a lower variance than the survey data.

Notes

[17] In addition to these tables, in order to compare to papers that take the log of income for their analysis, Table 14 shows the log of HES and IRD income over time. The average log difference between HES and IRD income is 0.016 log points, and the average of the absolute log difference is 0.3 log points.
[18] A graph using the same technique but for a different purpose is shown in Tax Working Group (2010).
[19] The early spikes to the left of $20,000 are from government transfers, largely pensions.
[20] To further protect people’s privacy, these data have been jittered, only a 5% random sample is shown and everybody with income higher than $120,000 (or less than $0) on either measure has been excluded. The raw scatter plots do not qualitatively differ.
[21] These controls are similar to that used in Hyslop and Townsend (2016a), including the same binning of age. While the controls are similar, they are not the same (perhaps most notably we also include a control for IRD income). This means some differences in coefficients could be due to differences in the covariates included in addition to the underlying difference between HES and SoFIE.

6 HES benefits vs IRD benefits#

Summary

Main benefits are measured poorly in HES. Many people fail to report benefits they received, and many other people report benefits they did not receive. Of people the Ministry of Social Development have paid main benefits to, only 75% report any type of benefit in HES, and of those who report receiving a main benefit in HES, only 88% have a record of receiving one in the MSD data.
On average, under-reporting of benefit receipt is more common than over-reporting. This is true even for people who report benefit receipt in both data sources.
63% of people who receive Accommodation Supplement from the Ministry of Social Development do notreport receiving Accommodation Supplement in HES.

This section compares self-reported HES data on benefit receipt to administrative benefit records provided by the Ministry for Social Development (MSD) to Inland Revenue.

The HES coverage of the benefit system is, in principle, the same as MSD. MSD benefits are attributed exclusively to individuals, so there is no possibility of accidentally attributing benefits to a company. However, HES respondents may not remember what benefits they have received over the last year, they may make approximations such as rounding time periods or they may confuse different types of benefits, for example, Supported Living Payment and Disability Allowance. Respondents may also feel stigmatised or feel like their privacy is being invaded and decide not to report benefit receipt.

No data set is perfect, and it is possible that there are errors in the administrative data too, including from mismatches or other reasons.[22]

We focus on two comparisons across all main benefits - comparing benefit indicators and comparing the number of days on benefit. Accommodation Supplement, a non-taxable housing subsidy available to both beneficiaries and non-beneficiaries, is also compared in this section. Comparisons by main benefit type are presented in Appendix C.

The benefit system was changed on 15 July 2013. For this reason, we compare HES and MSD benefits under the old system (we performed these comparisons for the new benefit system and found similar patterns). Comparisons of the old benefits use HES surveys 2006/07-2012/13. We exclude the HES 2013/14 year as this year included both benefit systems, and the HES data has known issues with benefit reporting in this year.[23] Accommodation Supplement (AS) wasn't changed as part of the 2013 reform, so the comparison presented for AS are for the full time period.

Table 3 shows that, of the people MSD have paid main benefits to, only 75% report any type of benefit in HES. It also shows that only 88% of those who report receiving a main benefit in HES have a corresponding MSD record. The level of misreporting is too high to be explained by a 1.4% false positive link rate. For example, a 1.4% false positive rate would generate about 540 mismatches, and as a rough back-of-the-envelope calculation, we might expect about 50 in each of the lower left and upper right cells or about 10% and 5% of what we observe.[24] Even if the false positive rate was as high as 10%, this would only account for about 340 in the lower-left cell and 400 in the upper-right cell. Thus, most of the people in lower-left cell and upper-right cell of Table 3 are likely due to survey respondent error. Respondents are both reporting benefits that they do not receive and failing to report benefits that they do receive, though on average, respondents are under-reporting benefit receipt.

Table 3: HES vs MSD any benefit indicator
(2006/07-2012/13)
Any HES benefit indicator	Any MSD benefit indicator
Any HES benefit indicator	No	Yes
No	33,573 (87.2%)	1,110 (2.9%)
Yes	465 (1.2%)	3,339 (8.7%)

Table 4 shows a similar tabulation for Accommodation Supplement (AS), which shows that only 37% of people MSD has paid AS to report it in HES (ie, 63% do not report it). For people who report receiving AS in HES, only 79% have a corresponding MSD record.

/thead>

Table 4: Accommodation Supplement HES vs
administrative indicator (2006/07-2014/15)
HES AS indicator	MSD AS indicator
HES AS indicator	No	Yes
No	46,662 (87.8%)	3,681 (6.9%)
Yes	594 (1.1%)	2,199 (4.1%)

Figure 4 shows the difference in total days on benefit between MSD and HES across all benefit types. Figure 4 shows the old benefit system, though the conclusions apply to the much smaller sample size using the new benefit system. The skew towards the positive indicates that MSD records more days on benefit than is reported in HES. Simply put, benefits are under-reported on average in HES. This may be due to, among other things, stigma associated with receiving a benefit, confusion between benefits and tax credits or linkage error. The bar on the far left (negative) shows HES recipients who report being on benefit for a full year, yet MSD has no record of paying this person a benefit. This shows, in addition to under-reporting of HES benefits, even the people reporting an HES benefit may not actually be receiving one.

Figure 4: Difference in total days on any benefit between MSD and HES

These results are likely to have less impact on Treasury's Taxwell modelling than might be expected for three reasons. First, Treasury's calibration (reweighting) process explicitly uses the total number of main benefit recipients as a benchmark. Which households receive main benefits will change, but the weights will adjust so that the total number of beneficiaries remains constant. Any changes to aggregate costs would need to come through benefit rates, which are modelled directly and hence are unlikely to dramatically change. Second, non-benefit income is used to determine if an individual is eligible to receive a benefit in Taxwell - only people reporting an HES benefit with the appropriate non-benefit income level will actually receive it. This will fix some of the people who report receiving a benefit in HES when no MSD record exists. Third, AS is modelled on an eligibility basis.[25] This likely means that Taxwell is overestimating the number of people receiving AS. Making the necessary changes to Taxwell to determine the impact of moving to administrative data is beyond the scope of this paper.

Because the survey measures of benefit receipt compare so poorly to the administrative measures, there is a strong case for using administrative data on benefit receipt in place of HES data in Taxwell. This is particularly so in the case of the second-tier benefit Accommodation Supplement.

Notes

[22] Benefit data are sourced from the IDI tables named msd_clean.msd_spell and msd_clean.msd_partner
[23] See the HES 2012/13 Commentary which notes that "However, there was a 10.0 [percent] fall in the number of people receiving this [government benefits except New Zealand Superannuation or war pensions] income".
[24] The rough calculation is as follows. Lower-left cell estimate = false positive rate*Number of people on benefit in HES*proportion in MSD not on benefit = 48. The upper-right cell estimate is false positive rate*Number of people on benefit in MSD data*proportion on HES not on benefit = 56. These estimates assume that the false positive rate is independent of benefit receipt and that who people are mislinked to is also independent of benefit receipt. The number of people on benefit/not on benefit in MSD figures is based on those linked to the HES data. These assumptions are not likely to hold completely, but they provide a useful approximation.
[25] It is possible in Taxwell to use reported AS or even no AS. These are only used for testing purposes.

7 Conclusion#

This analysis has investigated how survey responses from HES compare with the administrative data available in the IDI on a limited range of variables. Three types of comparative analysis were presented. The first compared the linked population to the unlinked population, which concluded that there are significant differences in the two populations in terms of income and ethnicity, which would need to be addressed through adjustments to the calibration processes for Taxwell. The second type of comparative analysis compared HES income to IRD income. We found an overall strong correlation, although on average, HES income is 1.5-6% higher than IRD income. This suggests that a tax model based on IRD income data could be somewhat more accurate. The third type of comparative analysis compared HES benefit measures to MSD benefit measures and concluded that, because the survey measures of benefit receipt compare poorly to the administrative measures, there is a strong case for incorporating administrative data on benefit receipt into Taxwell.

For researchers considering supplementing HES with administrative data, changing the benefit data is where the largest gains are likely to be made. Following that, replacing the HES income data with IRD data is the next logical step, though because some categories of income are collected in HES but not recorded in the linked IRD data (such as investment income and irregular income), which income source a researcher prefers will depend on their research question. As part of making any of these changes, the survey weights and population definition would need careful consideration.

Appendices#

Appendix A: Income concordance#

Table 5: Detailed concordance between HES and IDI
(comparable sources - excluding benefits - only)**[26]**
Income category	Detailed HES codes included	HES data dictionary text	IRD code
Wages	1.1.1.01	Wages and salaries from 1st current job	W&S
	1.1.1.02	Wages and salaries from 2nd current job
	1.1.1.03	Wages and salaries from other current jobs
	1.1.1.04	Bonuses from all current jobs
	1.1.1.05	Commission from all current jobs
	1.1.1.07	Other taxable income from all current jobs
	1.1.2.01	Wages and salaries from 1st previous job
	1.1.2.02	Wages and salaries from 2nd previous job
	1.1.2.03	Wages and salaries from other previous jobs
	1.1.2.04	Redundancy from 1st previous job
	1.1.2.05	Redundancy from 2nd previous job
	1.1.2.06	Redundancy from other previous jobs
	1.1.2.07	Bonuses from all previous jobs
	1.1.2.08	Commission from all previous jobs
	1.1.2.10	Other taxable income from all previous jobs
Self-employment income	1.2.1.01	Self-employment income from 1st current job	S00, S01, S02, C00, C01, C02, P00, P01, P02
	1.2.1.02	Self-employment income from 2nd current job
	1.2.1.03	Self-employment income from other current jobs
	1.1.1.06	Director fees, honoraria, remuneration for school-board of trustees from all current jobs
	1.2.2.01	Self-employment income from 1st previous job
	1.2.2.02	Self-employment income from 2nd previous job
	1.2.2.03	Self-employment income from other previous jobs
	1.1.2.09	Director fees, honoraria, remuneration for school-board of trustees from all previous jobs
	2.5.0.03	Income from partnership as a non-working shareholder or proprietor
Pensions	3.1.0.01	New Zealand Superannuation	PEN
	3.1.0.02	Veteran’s Pension
	3.1.0.03	War Disablement Pension
	3.1.0.04	Surviving Spouse Pension
Paid parental leave	3.2.0.05	Paid parental leave paid by Inland Revenue	PPL
Student allowance	3.2.0.27	Student allowance	STU
Rental income	2.3.0.01	Income from rent	S03
Total comparable income	Sum of all of the above categories		Sum of all of the above categories

Note that total income excludes some categories in each data set.

Table 6: Definitions of IRD income source codes
IRD income source code	Statistics New Zealand description text
W&S	Wages and salaries
BEN	Benefit payments from the Ministry of Social Development
CLM	Accident Compensation Corporation (ACC) payments
PEN	Pension payments from MSD
PPL	Paid parental leave payments from MSD
STU	Student allowance payments from MSD
C00	Director/shareholder income from the IR20
C01	Company director/shareholder receiving PAYE deducted income
C02	Company director/shareholder receiving WHT deducted income
P00	Partnership income from the IR20
P01	Partner receiving PAYE tax deducted income
P02	Partner receiving withholding tax deducted income
S00	Sole trader income from the IR3
S01	Sole trader receiving PAYE deducted income
S02	Sole trader receiving withholding tax deducted income
S03	Rental income from the IR3

Notes

[26] Note all IDI data on incomes came from the "income_tax_yr_summary" table under the data scheme. This table is derived by Statistics New Zealand from MSD and IRD data.

Appendix B: Additional income comparisons#

Tables 7-9 present cross-tabulations of HES income vs IRD income for overall income, wages and salary income and self-employment income. These cross-tabulations support the scatter plots presented in the main text and highlight the density of people who have recorded income of zero in both HES and IRD data.

These people are most pertinent in the cross-tabulation of self-employment income in Table 8, where we see that most people have their self-employment income classified correctly as zero. However, for those who do have self-employment income reported in either data source, the correlation between the HES and IRD measure is 0.25. This can be seen in Table 9 and Figure 5.

Table 7: Cross-tabulation of HES and IRD total comparable income (2006/07-2014/15)

Note that all counts have been randomly rounded to base 3, and counts less than 6 have been suppressed. This means that columns and rows may not sum to totals.

Table 8: Cross-tabulation of HES and IRD wage and salary income (2006/07-2014/15)

Note that all counts have been randomly rounded to base 3, and counts less than 6 have been suppressed. This means that columns and rows may not sum to totals.

Table 9: Cross-tabulation of HES and IRD self-employment income (2006/07-2014/15)

Note that all counts have been randomly rounded to base 3, and counts less than 6 have been suppressed. This means that columns and rows may not sum to totals.

We have repeated the analysis of total income for each subcategory of income. Outside of wages and salaries most people are correctly classified as earning zero in both data sources. However, those who report positive income in at least one of the data sources typically have a very different value (often zero) in the other source (see the self-employment income scatterplot in Figure 5). The discrepancies in these income categories have little effect on aggregate income because few people receive income from these sources and those who do receive it often earn only small amounts. Results for self-employment income are shown in Figure 5. As already mentioned in section 5, most of the conclusions about total income apply to wages and salaries. A scatterplot of wages and salaries is presented in Figure 3, which is practically identical to Figure 2 on overall income. Figure 6 presents a scatterplot of combined self-employment and wages and salaries, which, unsurprisingly, is near identical to Figures 2 and 3.

Figure 5: HES self-employment vs IRD self-employment

Figure 6: Combined wages and self-employment

Appendix C: Subcategory benefit comparison#

Table 10 compares HES measures of benefit receipt to MSD records of benefit receipt. To read this table, start by reading each yellow square, which show a cross-tabulation of the number of people with HES or MSD indicators of benefit receipt at any point during the year. Ideally, the counts on the off-diagonal elements of the yellow squares would be zero, which would show that HES classified most people in the same way as MSD’s records.

Table 10: Comparison of HES benefits to administrative benefits for 'old' benefits (2006/07-2012/13)

The results in Table 10 are similar to those seen in the new benefit system. Unemployment Benefit has the worst HES reporting, with 52% of MSD recipients not reporting this benefit in HES. On the other end of the scale for working-age benefits, only 25% of Invalid’s Benefit (IB) recipients fail to report this in HES. There are also a surprising number of people who either misreport the benefit type or incorrectly indicate they received a benefit during the interview window. These people are covered in detail in section 5, although it is interesting that HES DPB recipients are relatively well reported. The final benefit comparison is the difference in number of days reported by benefit type. Three examples covering the spectrum are presented. Figure 7 (UB) and Figure 9 (AS) show severe under-reporting (tilted towards the positive direction), whereas Figure 8 (DPB) shows only mild evidence of under-reporting.[27]

These figures show that, even though there is heterogeneity by subcategory, problems with HES benefit reporting are not dominated by one benefit type.

Figure 7: Unemployment Benefit

Figure 8: Domestic Purposes Benefit

Figure 9: Accommodation Supplement

Notes

[27] The band width in these histograms is 14 days except for the far left and right spikes, which are the bands [-∞,-359] and [355, ∞]. The central spike in each figure is for the for the interval [-9,4].

Appendix D: Comparisons over time#

Table 11: Link rates over time
	Unlinked	Linked	Percent linked
06/07	1,083	4,704	81.3%
07/08	1,326	5,631	80.9%
08/09	1,143	5,580	83.0%
09/10	918	5,412	85.5%
10/11	954	6,078	86.4%
11/12	1,089	6,081	84.8%
12/13	849	4,998	85.5%
13/14	1,191	5,652	82.6%
14/15	2,070	8,997	81.3%
Total	10,626	53,136	83.3%

Notes that because all counts have been randomly rounded to base 3, the people in each year may not sum to the total people in all years. Linked means linked to the IRD data.

Table 12: Mean HES and IRD total comparable income over time
HES year	Mean IRD overall income (1)	Mean HES overall income (2)	Percentage difference (3)
06/07	28,451	29,351	3.2%
07/08	30,913	32,383	4.8%
08/09	31,576	33,493	6.1%
09/10	32,035	32,873	2.6%
10/11	32,638	34,280	5.0%
11/12	33,742	34,942	3.6%
12/13	35,595	37,020	4.0%
13/14	36,545	37,095	1.5%
14/15	35,914	37,581	4.6%
Overall average	33,286	34,609	4.0%

Notes: Values in columns (1) and (2) have been rounded to the nearest dollar. Means are calculated on the entire linked sample. IRD income is used as the denominator in the percentage difference calculation. Overall average weights each year by the number of people in that year.

Table 13: Conditional mean IRD and HES total comparable income over time
HES Year	Conditional mean IRD income (1)	Conditional mean HES income (2)	Percentage difference (3)
06/07	34,850	35,576	2.0%
07/08	37,371	38,418	2.7%
08/09	39,039	40,698	4.1%
09/10	39,449	40,110	1.6%
10/11	40,542	41,879	3.2%
11/12	42,424	42,832	1.0%
12/13	44,008	45,220	2.7%
13/14	45,066	45,423	0.8%
14/15	44,177	45,496	2.9%
Overall average	41,071	42,063	2.4%

Notes: Values in columns (1) and (2) have been rounded to the nearest dollar. Conditional means are calculated for people in the linked sample who have reported earnings in both data sources. IRD income is used as the denominator in the percentage difference calculation.

Table 14: Log comparable income over time
Year	(1) ln(HES comparable income)	(2) ln(IRD comparable income)	(3) ln(HES) - ln(IRD comparable income)	(4) abs[ln(HES) - ln(IRD)]
06/07	10.041	10.074	-0.033	0.348
	(1.126)	(1.018)	(0.77)	(0.688)
07/08	10.165	10.129	0.036	0.305
	(1.006)	(1.024)	(0.62)	(0.54)
08/09	10.226	10.199	0.027	0.293
	(0.998)	(1.004)	(0.616)	(0.542)
09/10	10.201	10.182	0.019	0.307
	(1.059)	(1.03)	(0.665)	(0.59)
10/11	10.237	10.216	0.021	0.303
	(1.04)	(1.021)	(0.656)	(0.582)
11/12	10.293	10.271	0.022	0.288
	(1.028)	(1.019)	(0.649)	(0.582)
12/13	10.326	10.313	0.012	0.287
	(0.989)	(0.964)	(0.614)	(0.543)
13/14	10.333	10.323	0.01	0.297
	(1.032)	(1.014)	(0.635)	(0.561)
14/15	10.335	10.316	0.02	0.305
	(1.009)	(1)	(0.63)	(0.551)
Average	10.248	10.232	0.016	0.303
	(1.033)	(1.014)	(0.649)	(0.574)
N	42,006	42,006	42,006	42,006

Notes: Columns (1) and (2) present mean of log HES and IRD comparable income in each year. Column (3) reports the difference of Column (1) and (2), ie, MEAN(ln_HES - ln_IRD). Column (4) reports the average of the absolute value of the log difference, ie, MEAN(ABS(ln_HES - ln_IRD)). All calculations are on the linked subsample people who have positive (greater than $1) reported incomes. The total reported in the last row has been randomly rounded to base 3.

Appendix E: Correlates of differences#

Table 15: Correlates of differences
Variable	(1) HES earnings- IRD earnings	(2) abs(HES - IRD)	(3) ln(HES) - ln(IRD)	(4) abs[ln(HES) - ln(IRD)]
IRD comparable income	-0.2***	0.2***
	(0.024)	(0.021)
ln(IRD comparable income)			-0.29***	-0.28***
			(0.009)	(0.008)
Female	-4884***	-1662***	-0.12***	-0.12***
	(442)	(399)	(0.008)	(0.007)
Aged 15-24	-5927***	-1636***	-0.3***	-0.19***
	(699)	(620)	(0.015)	(0.013)
Aged 25-54	-1077**	-2233***	-0.03***	-0.03***
	(446)	(404)	(0.008)	(0.007)
Doctorate degree	1586	-7418***	0.03	0.01
	(1613)	(1475)	(0.036)	(0.034)
Level 1 certificate - level 3 certificate	-5853***	26	-0.14***	-0.1***
	(736)	(658)	(0.012)	(0.01)
Level 4 certificate	-4945***	84	-0.09***	-0.09***
	(822)	(751)	(0.013)	(0.011)
Level 5 diploma, level 6 diploma	-4106***	-11	-0.07***	-0.04***
	(690)	(630)	(0.013)	(0.012)
Master's degree	1102	-1202	0.02	0.04**
	(1223)	(1074)	(0.02)	(0.018)
No qualification	-8728***	136	-0.22***	-0.13***
	(879)	(792)	(0.015)	(0.013)
Education not specified	-1818	10125	-0.04	-0.1*
	(3280)	(2544)	(0.088)	(0.059)
Other NZ secondary school qualification	-5928***	1556	-0.16***	-0.08*
	(1995)	(1848)	(0.056)	(0.046)
Other post-school qualification	-5261***	269	-0.1***	-0.07***
	(1029)	(941)	(0.017)	(0.014)
Overseas secondary school qualification	-6529***	-574	-0.16***	-0.09***
	(907)	(789)	(0.022)	(0.019)
Postgraduate and honours degrees	1864*	-823	0.04***	0.01
	(1007)	(907)	(0.015)	(0.013)
Māori	-682*	-282	0.01	0.01
	(379)	(342)	(0.012)	(0.01)
Pacific	-3455***	118	-0.07***	0.03**
	(435)	(357)	(0.016)	(0.013)
Asian	-4240***	1542***	-0.1***	-0.01
	(614)	(562)	(0.014)	(0.012)
Middle Eastern/Latin American/African	-2622	3810**	-0.03	0.05
	(1759)	(1520)	(0.047)	(0.038)
Other ethnicity	-1418*	-684	-0.04**	-0.02
	(806)	(711)	(0.017)	(0.013)
Hours worked (from HES)	22***	5	0.00	0.00
	(4)	(3)	(0.000)	(0.000)
N	31,821	31,821	31,821	31,821
R-squared	0.079	0.115	0.153	0.192

Notes: Each column in this table shows the output of a separate regression. Columns (1) and (2) use the difference and absolute difference of HES and IRD incomes as the dependent variable. Columns (3) and (4) use the difference and absolute difference in log income as the dependent variable. The same observations are used in each regression. The observations are restricted to those with positive income (> $1) for both HES and IRD income and no missing covariates. The observation count has been randomly rounded to base 3. The leave out category for education is a bachelor's degree. The leave out ethnicity is European (though people can be in multiple ethnicity categories). The leave out age category is those 55 and over. Robust standard errors in parentheses. Stars denote: * p

Appendix F: Reliability ratios#

We believe most readers will find the correlation coefficients and scatter plot and mean comparisons of income sufficient comparisons of the HES-IRD income measures. However, many articles in the survey and administrative data literature use the reliability ratio to compare the two measures (for example, Abowd and Stinson (2013) and Hyslop and Townsend (2016a)), so for comparison with these papers, we calculate reliability ratios here.

The reliability ratio provides a measure of agreement between the two sources (with higher reliability ratios preferred) and is defined, for measure of income a w.r.t. income source b as

where Y_a and Y_b are measures of income from source a and b and is the same as the regression coefficient β from the OLS regression of Y_b = a + βY_a. Similarly the reliability ratio for Y_b is calculated as the β from the OLS regression of Y_a= a + βY_b.

If one assumes that Y_b = Y_truth, the reliability ratio for Y_a,

represents a measure of the truth-to-noise ratio for Y_a. Of course, as is the case in empirical research, one doesn't know which measure of income, Y_a or Y_b, is correct, if either, then the reliability ratio does not have this interpretation.[28] It is sometimes hoped that the measure with the lower variance and hence the higher reliability ratio will be closer to the truth, though in general this need not be true.

The reason for this hope is because we can write each of Var(Y_a) and Var(Y_b) as Var(Y_truth) + Var(error_a) + 2Cov(Y_truth,error_a), and similarly we can write Var(b) = Var(Y_truth) + Var(error_b) + 2Cov(Y_truth,error_b) , where error_a and error_b are defined as residuals from the equations Y_a = Y_truth + error_aand Y_b = Y_truth + error_b. Hence one hopes that, if Var(Y_a) > Var(Y_b) (ie, Y_b has the higher reliability ratio), then maybe Var(error_a) > Var(error_b) but of course there may not be the case since it is possible that Cov(Y_truth,error_a) - Cov(Y_truth,error_b) (error_a) - Var(error_b).

Table 16 shows reliability ratios for HES and IRD total comparable income, using a range of different transformations and samples. Table 17 shows the reliability ratios for wage and salaries and self-employment income. In most cases, the reliability ratio is slightly higher for IRD income than HES income, and both HES and IRD reliability ratios are higher when we condition on positive income in each data set. We observe higher reliability ratios for wage and salary income than for overall income, at least once we condition on positive income in each data source.

As well as analysing the reliability ratios based on the levels of income, we also analyse it using the inverse-hyperbolic sine of income and the log of income. The inverse hyperbolic sine function is defined as

. In all transformations, we set θ=1 (different values of θ were tried and did not materially affect the results). The inverse hyperbolic sine is a log like function and so unsurprisingly gives very similar results on the same samples as the log (the same reliability ratio in the last two columns of Table 16). However, unlike the log function, the inverse hyperbolic sine function is defined for zero and negative values so allows for comparisons that include those values. In any event, the different transformations all give a similar impression of the reliability ratio.

Table 16: Reliability ratios for total comparable income
	IRD comparable income	IRD comparable income	asin(HES comparable income)	asin(HES comparable income)	ln(IRD comparable income)
Constant	7,650	8,710	2.61	2.35	2.198
Standard error	(670)	(890)	(0.05)	(0.08)	(0.079)
HES RR	0.741	0.771	0.715	0.784	0.784
Standard error	(0.021)	(0.023)	(0.005)	(0.008)	(0.008)
R-squared	0.629	0.659	0.534	0.639	0.639
N	53,136	42,006	53,136	42,006	42,006
Correlation coefficient	0.793	0.812	0.731	0.799	0.799
Conditional on positive income	NO	YES	NO	YES	YES

Table 16: Reliability ratios for total comparable income
	HES comparable income	HES comparable income	asin(HES comparable income)	asin(HES comparable income)	ln(HES comparable income)
Constant	6,370	7,000	2.23	2.04	1.912
Standard error	(500)	(730)	(0.06)	(0.07)	(0.068)
IRD RR	0.848	0.855	0.748	0.815	0.815
Standard error	(0.016)	(0.019)	(0.005)	(0.006)	(0.006)
R-squared	0.629	0.659	0.534	0.639	0.639
N	53,136	42,006	53,136	42,006	42,006
Correlation coefficient	0.793	0.812	0.731	0.799	0.799
Conditional on positive income	NO	YES	NO	YES	YES

Notes: The main purpose of this table is to report the HES reliability ratio (HES RR) and IRD reliability ratio (IRD RR). The HES RR (IRD RR) is computed from a regression of IRD (HES) income on HES (IRD) income and a constant. The coefficient on HES (IRD) from this regression is the HES (IRD) reliability ratio. Each column refers to a separate regression, with constant, standard error, R-squared and the sample size reported. Different regressions are for different transformations of comparable income (levels, arc sine transform (asin) and log) and whether the regression is restricted to people with positive incomes in each data source. Sample sizes have been randomly rounded to base 3. The correlation coefficient for the same sample and income transform as the regression is also reported.

Table 17: Reliability ratios for wage and self-employment income
Dependent variable	IRD wages	IRD wages	asin(IRD wages)	asin(IRD wages)	ln(IRD wages)	IRD self	IRD self	ln(IRD self)
Constant	4,460	4,990	1.13	1.92	1.8	3,900	19,820	4.846
Standard error	(120)	(160)	(0.02)	(0.09)	(0.087)	(120)	(2390)	(0.332)
HES RR	0.724	0.881	0.773	0.823	0.823	0.39	0.43	0.507
Standard error	(0.002)	(0.003)	(0.003)	(0.008)	(0.008)	(0.046)	(0.058)	(0.032)
R-squared	0.622	0.79	0.594	0.67	0.67	0.116	0.382	0.307
N	53,136	29,613	53,136	29,613	29,610	53,136	2,013	2,013
Correlation coefficient	0.788	0.889	0.771	0.818	0.818	0.341	0.618	0.554
Conditional on positive income	NO	YES	NO	YES	YES	NO	YES	YES

Table 17: Reliability ratios for wage and self-employment income
Dependent variable	HES wages	HES wages	asin(HES wages)	asin(HES wages)	ln(HES wages)	HES self	HES self	ln(HES self)
Constant	7,170	5,020	1.97	2.07	1.941	690	13,020	4.013
Standard error	(130)	(160)	(0.03)	(0.09)	(0.082)	(180)	(3110)	(0.264)
IRD RR	0.858	0.897	0.768	0.813	0.813	0.297	0.888	0.604
Standard error	(0.003)	(0.003)	(0.003)	(0.008)	(0.008)	(0.045)	(0.086)	(0.026)
R-squared	0.622	0.79	0.594	0.67	0.67	0.116	0.382	0.307
N	53,136	29,613	53,136	29,613	29,610	53,136	2,013	2,013
Correlation coefficient	0.788	0.889	0.771	0.818	0.818	0.341	0.618	0.554
Conditional on positive income	NO	YES	NO	YES	YES	NO	YES	YES

Notes: The main purpose of this table is to report the HES reliability ratio (HES RR) and IRD reliability ratio (IRD RR). The HES RR (IRD RR) is computed from a regression of IRD (HES) income on HES (IRD) income and a constant. The coefficient on HES (IRD) from this regression is the HES (IRD) reliability ratio. Each column refers to a separate regression, with constant, standard error, R-squared and the sample size reported. Different regressions are for different definitions of income (wage income and self-employment income), different transformations of income (levels, arc sine transform and log) and whether the regression is restricted to people with positive incomes in each data source. Sample sizes have been randomly rounded to base 3. The correlation coefficient for the same sample and income transform as the regression is also reported.

Notes

[28] Some papers, such as Abowd and Stinson (2013) and Kapteyn and Ypma (2007) calculate more sophisticated reliability ratios that include priors other than certainty-of-truth on each data source.

References#

Abowd, J. & Stinson, M. (2013). Estimating Measurement Error in Annual Job Earnings: A Comparison of Survey and Administrative Data. The Review of Economics and Statistics95(5).

Ball, C. (2016). Estimating Income Dynamics from Cross-Sectional Data Using Matching Techniques. Working Papers in Public Finance 06/2016.

Bound, J., Brown, C., Duncan, G. J. & Rodgers, W. L. (1994). Evidence on the validity of cross-sectional and longitudinal labor market data. Journal of Labor Economics, 12(3), 345-368.

Bound, J., Brown, C. & Mathiowetz, N. (2001). Measurement error in survey data. Handbook of Econometrics, 5, 3705-3843.

Bound, J. & Krueger, A. B. (1991). The extent of measurement error in longitudinal earnings data: Do two wrongs make a right? Journal of Labor Economics, 9(1), 1-24.

Britton, J., Shephard, N. & Vignoles, A. (2015). Comparing sample survey measures of English earnings of graduates with administrative data. Unpublished paper: Department of Economics, Harvard University.

Chapple, C & Crichton, S (2012). The Labour Market Activity of Work-tested Beneficiaries. Labour, Employment and Work in New Zealand Conference Paper.

Hyslop, D. & Townsend, W. (2016a). Earnings Dynamics and Measurement Error in Matched Survey and Administrative Data. Motu Working Paper16/18.

Hyslop, D. & Townsend, W. (2016b). Employment misclassification in survey and administrative reports. Motu Working Paper 16/19.

Kapteyn, A. & Ypma, J. (2007). Measurement Error and Misclassification: A Comparison of Survey and Administrative Data. Journal of Labor Economics,25(3), 513-551.

Miller, H. P. & Paley, L. R. (1958). Income reported in the 1950 Census and on income tax returns. In An Appraisal of the 1950 Census Income Data (pp. 177-204). Princeton, New Jersey: Princeton University Press.

Pischke, J. S. (1995). Measurement error and earnings dynamics: Some estimates from the PSID validation study. Journal of Business & Economic Statistics, 13(3), 305-314.

Samoilenko, A. & Law, D. (2014). KiwiSaver: Comparing Survey and Administrative Data (New Zealand Treasury Working Paper). 14/06 http://purl.oclc.org/nzt/p-1635

Schroeder, L. & Sjoquist, D. (1976). Survey Reporting Errors and Class Income Interval Definitions. Social Science Quarterly, 56(4), 715-720.

Statistics New Zealand. (2013). Data integration manual: 2^nd edition. Available from www.stats.govt.nz.

Statistics New Zealand. (2014). Linking methodology used by Statistics New Zealand in the Integrated Data Infrastructure project. Available from www.stats.govt.nz.

Suei, S. (2016). Comparing income information from census and administrative sources. Retrieved from www.stats.govt.nz.

Tax Working Group. (2010). A tax system for New Zealand's future: report of the Victoria University of Wellington Tax Working Group. Wellington, New Zealand: Victoria University of Wellington.

Comparing the Household Economic Survey to administrative records: An analysis of income and benefit receipt (AP 17/01)

Formats and related files

Abstract#

Acknowledgements#

Code Availability#

Disclaimer#

Executive summary#

1 Introduction#

Notes

2 Previous literature#

Notes

3 Data#

Notes

3.1 Interpreting the differences between the two data sources#

Notes

3.2 Income concordance#

Notes

4 Comparing the linked people to the unlinked people and summary statistics#

Summary

Notes

5 Income comparisons#

Summary

Notes

6 HES benefits vs IRD benefits#

Summary

Notes

7 Conclusion#

Appendices#

Appendix A: Income concordance#

Notes

Appendix B: Additional income comparisons#

Appendix C: Subcategory benefit comparison#

Notes

Appendix D: Comparisons over time#

Appendix E: Correlates of differences#

Appendix F: Reliability ratios#

Notes

References#