Appendix F: Reliability ratios

We believe most readers will find the correlation coefficients and scatter plot and mean comparisons of income sufficient comparisons of the HES-IRD income measures. However, many articles in the survey and administrative data literature use the reliability ratio to compare the two measures (for example, Abowd and Stinson (2013) and Hyslop and Townsend (2016a)), so for comparison with these papers, we calculate reliability ratios here.

The reliability ratio provides a measure of agreement between the two sources (with higher reliability ratios preferred) and is defined, for measure of income a w.r.t. income source b as where Ya and Yb are measures of income from source a and b and is the same as the regression coefficient β from the OLS regression of Yb = a + βYa. Similarly the reliability ratio for Yb is calculated as the β from the OLS regression of Ya = a + βYb.

If one assumes that Yb = Ytruth, the reliability ratio for Ya, represents a measure of the truth-to-noise ratio for Ya. Of course, as is the case in empirical research, one doesn't know which measure of income, Ya or Yb, is correct, if either, then the reliability ratio does not have this interpretation.[28] It is sometimes hoped that the measure with the lower variance and hence the higher reliability ratio will be closer to the truth, though in general this need not be true.

The reason for this hope is because we can write each of Var(Ya) and Var(Yb) as Var(Ytruth) + Var(errora) + 2Cov(Ytruth,errora), and similarly we can write Var(b) = Var(Ytruth) + Var(errorb) + 2Cov(Ytruth,errorb) , where errora and errorb are defined as residuals from the equations Ya = Ytruth + errora and Yb = Ytruth + errorb. Hence one hopes that, if Var(Ya) > Var(Yb) (ie, Yb has the higher reliability ratio), then maybe Var(errora) > Var(errorb) but of course there may not be the case since it is possible that Cov(Ytruth,errora) - Cov(Ytruth,errorb) < Var(errora) - Var(errorb).

Table 16 shows reliability ratios for HES and IRD total comparable income, using a range of different transformations and samples. Table 17 shows the reliability ratios for wage and salaries and self-employment income. In most cases, the reliability ratio is slightly higher for IRD income than HES income, and both HES and IRD reliability ratios are higher when we condition on positive income in each data set. We observe higher reliability ratios for wage and salary income than for overall income, at least once we condition on positive income in each data source.

As well as analysing the reliability ratios based on the levels of income, we also analyse it using the inverse-hyperbolic sine of income and the log of income. The inverse hyperbolic sine function is defined as . In all transformations, we set θ=1 (different values of θ were tried and did not materially affect the results). The inverse hyperbolic sine is a log like function and so unsurprisingly gives very similar results on the same samples as the log (the same reliability ratio in the last two columns of Table 16). However, unlike the log function, the inverse hyperbolic sine function is defined for zero and negative values so allows for comparisons that include those values. In any event, the different transformations all give a similar impression of the reliability ratio.

Table 16: Reliability ratios for total comparable income
IRD comparable
income
IRD comparable
income
asin(HES
comparable income)
asin(HES
comparable income)
ln(IRD
comparable income)
Constant 7,650 8,710 2.61 2.35 2.198
Standard error (670) (890) (0.05) (0.08) (0.079)
HES RR 0.741 0.771 0.715 0.784 0.784
Standard error (0.021) (0.023) (0.005) (0.008) (0.008)
R-squared 0.629 0.659 0.534 0.639 0.639
N 53,136 42,006 53,136 42,006 42,006
Correlation coefficient 0.793 0.812 0.731 0.799 0.799
Conditional on positive income NO YES NO YES YES
Table 16: Reliability ratios for total comparable income
HES comparable
income
HES comparable
income
asin(HES
comparable income)
asin(HES
comparable income)
ln(HES
comparable income)
Constant 6,370 7,000 2.23 2.04 1.912
Standard error (500) (730) (0.06) (0.07) (0.068)
IRD RR 0.848 0.855 0.748 0.815 0.815
Standard error (0.016) (0.019) (0.005) (0.006) (0.006)
R-squared 0.629 0.659 0.534 0.639 0.639
N 53,136 42,006 53,136 42,006 42,006
Correlation coefficient 0.793 0.812 0.731 0.799 0.799
Conditional on positive income NO YES NO YES YES

Notes: The main purpose of this table is to report the HES reliability ratio (HES RR) and IRD reliability ratio (IRD RR). The HES RR (IRD RR) is computed from a regression of IRD (HES) income on HES (IRD) income and a constant. The coefficient on HES (IRD) from this regression is the HES (IRD) reliability ratio. Each column refers to a separate regression, with constant, standard error, R-squared and the sample size reported. Different regressions are for different transformations of comparable income (levels, arc sine transform (asin) and log) and whether the regression is restricted to people with positive incomes in each data source. Sample sizes have been randomly rounded to base 3. The correlation coefficient for the same sample and income transform as the regression is also reported.

Table 17: Reliability ratios for wage and self-employment income
Dependent variable IRD
wages
IRD
wages
asin(IRD
wages)
asin(IRD
wages)
ln(IRD
wages)
IRD
self
IRD
self
ln(IRD
self)
Constant 4,460 4,990 1.13 1.92 1.8 3,900 19,820 4.846
Standard error (120) (160) (0.02) (0.09) (0.087) (120) (2390) (0.332)
HES RR 0.724 0.881 0.773 0.823 0.823 0.39 0.43 0.507
Standard error (0.002) (0.003) (0.003) (0.008) (0.008) (0.046) (0.058) (0.032)
R-squared 0.622 0.79 0.594 0.67 0.67 0.116 0.382 0.307
N 53,136 29,613 53,136 29,613 29,610 53,136 2,013 2,013
Correlation coefficient 0.788 0.889 0.771 0.818 0.818 0.341 0.618 0.554
Conditional on positive income NO YES NO YES YES NO YES YES
Table 17: Reliability ratios for wage and self-employment income
Dependent variable HES
wages
HES
wages
asin(HES
wages)
asin(HES
wages)
ln(HES
wages)
HES
self

HES
self

ln(HES
self)
Constant 7,170 5,020 1.97 2.07 1.941 690 13,020 4.013
Standard error (130) (160) (0.03) (0.09) (0.082) (180) (3110) (0.264)
IRD RR 0.858 0.897 0.768 0.813 0.813 0.297 0.888 0.604
Standard error (0.003) (0.003) (0.003) (0.008) (0.008) (0.045) (0.086) (0.026)
R-squared 0.622 0.79 0.594 0.67 0.67 0.116 0.382 0.307
N 53,136 29,613 53,136 29,613 29,610 53,136 2,013 2,013
Correlation coefficient 0.788 0.889 0.771 0.818 0.818 0.341 0.618 0.554
Conditional on positive income NO YES NO YES YES NO YES YES

Notes: The main purpose of this table is to report the HES reliability ratio (HES RR) and IRD reliability ratio (IRD RR). The HES RR (IRD RR) is computed from a regression of IRD (HES) income on HES (IRD) income and a constant. The coefficient on HES (IRD) from this regression is the HES (IRD) reliability ratio. Each column refers to a separate regression, with constant, standard error, R-squared and the sample size reported. Different regressions are for different definitions of income (wage income and self-employment income), different transformations of income (levels, arc sine transform and log) and whether the regression is restricted to people with positive incomes in each data source. Sample sizes have been randomly rounded to base 3. The correlation coefficient for the same sample and income transform as the regression is also reported.

Notes

• [28] Some papers, such as Abowd and Stinson (2013) and Kapteyn and Ypma (2007) calculate more sophisticated reliability ratios that include priors other than certainty-of-truth on each data source.
Page top