The Treasury

Global Navigation

Personal tools

4.1 Regression model factor selection and estimation

Predictive factors selected in the models

Appendix 1 Table 1 highlights the number of times each characteristic is selected across the different models at each age. This gives a broad idea of the characteristics that are important in predicting risk as someone ages through the late teen years and into their 20s. Care needs to be taken in interpreting the importance of these selections however.

The choice of factors to include in a forward selection modelling procedure is heavily dependent on the factors already selected in the model. Where a factor is highly correlated with another factor already included, it may not add much to the model and hence not be selected for the final model. With very slightly different data, the reverse may be true. For example, duration spent on a benefit is closely associated with the type of benefit, and each may be related to future time on a benefit. In cases where the duration is slightly more predictive of future benefit receipt and hence added to the model first, benefit type may not be included, even though it is also predictive of future receipt. With slightly different data (and possibly depending on what other variables are already added to the model, for example, use of mental health services or early parenting status), benefit type may be included but not duration.

Nevertheless, there are some interesting patterns in the risk factors selected for the models. Broadly speaking, as people age from 15 to 22, we have more information about them that can be used to predict future outcomes. At age 15, there were 42 potential factors used in the modelling, while by age 18, there were more than 60. With an increase in the number of potential factors, more factors were generally selected for the models. On average, 15.6 factors were used per model at age 15, increasing to around 21 at ages 18 to 20. While fewer factors were used in the models at ages 21 and (especially) 22, these were ages at which only welfare and corrections outcomes were being predicted, with fewer models run as a result.

Some specific patterns are evident in the table and are worth pointing out:

  • Some factors are clearly predictive across all outcomes and most ages. The most prominent of these is ethnicity, which is included in all 58 models, but 'Notified to CYF care and protection as a child' was included in 56 models (and all models up to age 20), 'Maternal caregiver education/benefit status' was included in 53 models (and all models up to age 19), and 'Referred to youth justice' was included in 48 models.
  • The only factors not included in any model were the 'Early parent (before age 19)' indicator and the 'Had own child in placement or with maltreatment finding' indicator. The former may be highly correlated with some benefit types, while the latter is closely linked to other indicators regarding interactions around the young person's child(ren), many of which were included in a few models.
  • As might be expected, characteristics relating to school-level qualifications were mainly important during the mid to late teenage years. The NCEA level 1 achievement indicator was used in all models at age 16 but no models after age 18, whilst levels 2 and 3 were most important at ages 18 to 20. Having been stood down from school was a significant factor for most models at ages 15 to 20, having been suspended from school was important at ages 15 to 16 and being recorded as being truant from school was important to most outcomes at age 17. Having received special education services was predictive in at least half of the models at all ages. School decile was important in most models at ages 15 to 17.
  • A number of factors were constructed relating to the enrolment and completion of tertiary qualifications, and these measures were included in various models from ages 18 to 21.
  • Simple yes or no indicators of employment were included in five of the models at ages 15 and 16, but the level of earnings became more important as a predictor of outcomes by the later teen years. Depending on the model, the factor selected related to the previous year or the previous two years. However, the variables are highly correlated, and the distinction may not be meaningful. Time spent NEET was important from ages 17 to 21 (not being available prior to age 17), with different factors constructed that covered different time periods. Indicators of benefit status, type and duration were included in all models from age 18, the minimum age of eligibility for most types of benefit.
  • Factors related to the young person's caregiver were particularly important at the younger ages. Having a caregiver with a community sentence was included in almost all models through to age 18, while having a caregiver with a custodial sentence was included in half of the models at age 15.
  • Unsurprisingly, accessing mental health services or being sentenced to a community or custodial sentence at any age were important predictors of poor mental health and corrections outcomes respectively. However, they were also broadly predictive of poor outcomes across other domains. In the case of corrections sentences, whether the sentence was custodial or community appears to be of limited importance in predicting outcomes. However, accessing alcohol or drug services appears to predict outcomes quite differently from accessing other mental health services, with the latter being much more broadly predictive across multiple outcomes domains.

Model discrimination

The area under the receiver operating characteristic (ROC) curve indicates how well each model is able to differentiate between those young people at each age who go on to have poor outcomes as adults and those that do not. The ROC statistic is a measure of how well a logistic regression model fits the data. Specifically, it measures how well the model discriminates between those with and without the outcome of interest.

The areas under the ROC curves for each of the 54 models that were run are given in Table 10 below. The model that fitted least well was that predicting future mental health outcomes for 15-year-old females (ROC statistic of 0.64), while the models that fitted the best were generally those predicting a corrections sentence or longterm benefit receipt at ages 20 to 22 (ROC statistics consistently above 0.8). The average across all 54 models was 0.80, indicating that the models were generally good at predicting who would experience a poor future outcome.

Comparing females to males, there was little difference in the ROC statistic, with the models for females having slightly better fit in general but only marginally so. Consistent with both more information becoming available over time (often closely linked to the outcomes of interest) and increasing proximity to the outcome period, predictions generally improved as a person aged. Average ROC statistics increased from around 0.75 at age 15 to almost 0.9 at age 22.

Some future outcomes also appear to be easier to predict at an early age than others. Averaged across ages 15 to 19 (ages 20 to 22 are excluded since not all outcomes were modelled), corrections and welfare outcomes had higher ROC statistics than the other two outcomes on average and at each year of age. Across all ages, the use of mental health services was clearly the most difficult to predict, with ROC scores considerably smaller than for other outcomes. This is perhaps not surprising given the earlier descriptive analysis, which showed less clear differentiation in mental health outcomes across key socio-demographic characteristics such as ethnicity, deprivation decile and school decile.

High ROC statistics at the older ages (especially 19 years and over) reflect the availability of measures that are closely related to the outcomes being modelled (for example, benefit receipt), as well as the close proximity of the age at which outcomes are measured. At age 19, for example, it is relatively easy to predict whether somebody will achieve a level 4 qualification by age 23, as qualifications achieved up to age 19 are known, as is the level of any current study being undertaken at that age.

Table 10: Areas under receiver operating characteristic (ROC) curves for each youth outcome model
  Model by outcome
Age No Level 2/4
Quals *
Mental Health Corrections sentence Longterm benefit Average


15 0.77 0.64 0.82 0.80 0.76
16 0.80 0.66 0.83 0.81 0.78
17 0.75 0.68 0.85 0.83 0.78
18 0.79 0.70 0.85 0.84 0.80
19 0.85 0.72 0.87 0.86 0.83
20 n/a 0.76 0.88 0.87 0.84
21 n/a n/a 0.89 0.88 0.88
22 n/a n/a 0.89 0.89 0.89
Average  15-19 0.79 0.68 0.85 0.83 0.79


15 0.74 0.66 0.78 0.77 0.74
16 0.77 0.68 0.79 0.79 0.76
17 0.74 0.69 0.81 0.81 0.76
18 0.77 0.70 0.82 0.83 0.78
19 0.83 0.72 0.83 0.86 0.81
20 n/a 0.75 0.84 0.88 0.84
21 n/a n/a 0.85 0.89 0.86
22 n/a n/a 0.86 0.89 0.88
Average  15-19 0.77 0.69 0.81 0.81 0.77
Average ALL 0.78 0.70 0.84 0.84 0.80

* Level 2 qualifications were modelled at ages 15 and 16, and level 4 at older ages.

Page top