The Treasury

Global Navigation

Personal tools

Treasury
Publication

Health and Labour Force Participation WP 10/03

Appendix C

Methods

Pooled logistic regressions

Initially, binomial logistic regression models were fitted to the data to quantify the relationship between the presence of different chronic diseases and labour force participation and between self-rated health and labour force participation, while holding all other variables constant. In the standard pooled regression models, responses in each wave were pooled together to form one large sample. Therefore each respondent had up to three responses in the sample. The fact that observations from the same person in different waves were not independent of each other, and therefore the error terms in the model were likely to be correlated, was accounted for by treating people as clusters.

A binomial logistic regression model is suitable as the dependent variable (L) is a binary response variable equal to one for those respondents who are participating and zero for those who are not participating (the latter was the reference category when a binomial logistic regression was carried out). The form of the equation can be seen in Figure C1. The unemployment rate at the time of the interview was included to reflect the possible differences in participation owing to the economic climate at the interview date. Maximum likelihood estimation was used to estimate the regression coefficients.[44]

A multinomial logistic regression was then fitted to the data to quantify the impact of the presence of diseases on the chance of being in one of the four labour market outcomes while holding all other variables constant. This aimed to determine if the impact of the presence of each disease was consistent across each labour market outcome. As there are more than two response categories in the dependent variable there is now more than one logistic regression model. Each model is the same as that in Figure C1 with the L indicator replaced with indicators for full-time, part-time and unemployed (LFTi, LPTi and LUi respectively), with the reference category being those who are inactive. The formula for the probability of success in each case is similar to that for the binomial logistic regression but with the denominator being the sum of the odds of success across each of the three response categories (excluding the reference category).

The main limitation of standard binomial and multinomial logistic regressions is that they do not allow for endogeneity. In other words they assume that the explanatory variables are exogenous; that is, their values are not affected by labour force participation or by other unobserved characteristics. However, this assumption may not be strictly true for any generic health measure (Hi) and the failure to account for endogeneity means that any significant relationships that are established are associations and do not imply causality; for instance, the fact that the model may prove a relationship between the dependent and predictor variables does not mean that the predictor variables caused the outcome (Tabachnick and Fidell, 2001)

Figure C1 - Form of binomial logistic regression model

where:             

Li = a binary response variable for participation for the th person equal to one if participating and zero otherwise              

1(.) = an indicator function that takes the value one or zero according to whether the value in parentheses is true or false

= a vector of regression coefficients

CDi = a vector of chronic disease indicators

Xi = a vector of explanatory variables

ui = error term associated with person

= odds of success

Note: The relationship between the responses for each person in the different waves (ie, time = 1, 2 or 3) is accounted for by identifying people as clusters. 

Fixed and random effects panel logistic regression[45]

While there were a number of control variables included in the standard pooled regressions, there may be some important individual characteristics that were not observed. The unobserved variables may significantly influence participation; they may influence (or be correlated) with ill health; or they may influence both of these. When the omitted variables are correlated with health, the estimates of the relationship between health and participation from the pooled regression model will be biased because the error term in the model will be correlated with the health variable (that is, health is endogenous, not exogenous, therefore violating an assumption of the logistic regression analysis).

One advantage of SoFIE is its panel aspect; that is, there are up to three observations per person. This opens up the prospect of fixed or random effects panel models to allow for time-constant unobserved heterogeneity. A fixed effects model exploits the panel nature of the data to determine how health shocks (changes in health) over time relate to changes in labour force participation allowing for time-invariant omitted variables that may be correlated with the explanatory variables (ie, the endogenous health). The fixed effects model is derived from the starting equation in Figure C2. The error term from the standard pooled regression model ui now has a time dimension and is made up of two components. These are αi, the time-constant unobserved variables for the ith person which may or may not be correlated with Hit, and the error term εit, which includes the true error and any unobserved variables that are time-varying. It is assumed that the time-variant unobserved variables are not correlated with the explanatory variables so that the error term, εit, is not correlated with Lit or Hit. Conditional logistic analysis differs from regular logistic regression in that data are grouped (with those who exhibit no changes in the outcome variable over the periods considered dropped) and the likelihood is calculated relative to each other group; that is, a conditional likelihood is used. The conditional likelihoods do not involve αi, so they do not need to be estimated (Stata, 2007). The model compares changes in the covariates with a change in the dependent variable. The coefficients indicate the relationship between a change in that covariate and the chance of participating. One drawback of the fixed effects model is that it removes all explanatory variables from the model which are time-invariant; for example, gender.[46] It also drops all respondents for whom the dependent variable (labour force participation) did not change over time. This significantly reduced the sample available for analysis.

Figure C2 - Initial form of the fixed and standard random effects logistic panel model

where:             

Lit = a binary response variable for participation for the th person at time

1(.) = an indicator function that takes the value one or zero according to whether the value in parentheses is true or false

= a vector of regression coefficients

Hit = a vector of variables to indicate self-rated health

Xit = a vector of explanatory variables

αi = unobserved time-invariant variables

εit = idiosyncratic error representing unobserved factors that change over time and affect   (Note: αi + εit = uit)

Fixed effects model:

Random effects model:

An alternative way to control for unobserved time-invariant variables is using a random effects model. The starting form of this model is the same as that presented in Figure C2, however, this time the assumption is that while the unobserved variables influence the dependent variable (labour force participation) they are not correlated with health. This means that the coefficient estimates from the standard pooled regression will not suffer from omitted variable bias, but that the error terms in the model will be serially correlated. The random effects model subtracts a fraction of that time averaged value, where the fraction depends on the variation of the unobserved variables, the variation of the idiosyncratic error and the number of time periods (for more explanation, see Wooldridge, 2006). The advantage of the method is that it includes explanatory and dependent variables that are constant over time. This means that the sample size available for analysis is not reduced as with the fixed effects model and that estimates of the effect of time constant variables are provided. However, the assumption that the omitted variables are not correlated with health is a disadvantage given that the unobserved variables that are correlated with health are of concern. One way to use the random effects model where some of the unobserved time constant variables are thought to be correlated with health is to make an assumption about the relationship between health and the unobserved time-invariant variables. This is the correlated random effects model. More specifically, as shown in Figure C3 it can be assumed that the expected value of the unobserved variables is equal to a linear function of the average time spent in each health state over the three waves together with a random term representing the unobserved time-invariant coefficients that are not correlated with health. Substituting this expected value into the starting equation for the fixed effects model results in the remaining unobserved time-variant coefficients being uncorrelated with health. A random effects model can therefore be used.

Figure C3 - Equations used in the correlated random effects logistic regression panel model

From Figure C2 the starting form of the fixed effects equation is:

Where:

i = person = 1, ... ., n
t = time = 1, 2, 3
It is assumed that:

where:

j = health state = 1 (excellent), ... ., 5 (poor)

Hit = a vector of variables to indicate self-rated health

For each health state = Proportion of time in the health state

For each person

η = unobserved time-invariant variables

and Cov(Hit, ηi) = 0

Combining equations (1) and (2) gives the standard form of the random effects model:

Results for both the fixed and correlated random effects models are presented in this paper. While the fixed and correlated random effect panel model goes further than the standard pooled regression, there are drawbacks. Firstly, the model only accounts for omitted variables that are time-constant, so any time-variant unobserved effects are in the error term. The assumption is that these time-varying omitted variables are uncorrelated with participation or with any of the explanatory variables. Secondly, while using fixed or correlated random effects models to look at how health changes are related to participation changes within respondents does control for the subjective nature of the self-rated health question (in the sense that some people will consistently be more optimistic in their health rating and some consistently more pessimistic) these models do not control for the other health measurement issues with self-rated health outlined in Section 4.2.2. Thirdly, these models do not allow the feedback effect to be estimated. Finally, an issue with the fixed effects model is that it only looks at how changes in health relate to changes in participation. It does not include estimates of the effect of poor health which possibly prevents a person working in the first place. This average health effect for the three waves is picked up in part in the correlated fixed effects model. However, if the assumption for the random effects model, that the expectation of the correlated unobserved time-invariant variables is a linear function of the average time in a health state, is incorrect, this model will be flawed.

Notes

  • [44]Fitting models separately for each gender was considered. However, for all chronic diseases other than psychiatric conditions the relationship between chronic disease and participation was in the same direction and of the same magnitude irrespective of gender. Further, for each disease the confidence intervals for the coefficients overlapped for male and female. For this reason, and owing to the relatively small numbers with certain diseases such as cancer, it was decided to fit the model for combined genders with interactions included for parameter estimates that appeared to differ by gender. These were psychiatric conditions, social marital status and the presence of children. This approach was continued when considering self-rated health to aid comparability.
  • [45]This section draws heavily on unpublished lecture notes by Dean Hyslop.
  • [46]Further, it is considered best practice to remove from the model specification all variables that may change over time, but are more or less fixed in reality.
Page top