The Treasury

Global Navigation

Personal tools

Treasury
Publication

Using Integrated Administrative Data to Identify Youth Who Are at Risk of Poor Outcomes as Adults

2.4 Approach to the identification of target populations

Identifying risk factors and predicting risk

Logistic regression models were run against the four outcome variables described in section 1.2, covering the welfare, health, education and corrections domains. Over 60 potential risk factors derived from a number of administrative data collections were included in the modelling exercise. Models were run at each year of age for females and males separately. Logistic regression with a forward selection was used to construct a model based on a reduced set of risk factors that were most predictive of each outcome measure. These factors are listed in Appendix 1, along with an indication of the number of models the factor was included in at each year of age.

This process allowed us to identify the key risk predictors for each age/gender combination and calculate an estimated risk score for each individual in the target population. The estimated risk score was used to define an 'at-risk population' according to the above criteria, which could then be used to identify target populations with a higher than average probability of being at risk of poor longer-term outcomes.

As discussed in section 2.3, long-term outcomes were estimated using statistical matching. These were then modelled against characteristics that were directly observed in the data, and this may dilute the relationships between the characteristics and outcomes in the models. Since matching was undertaken on a limited set of characteristics, it is possible that this may not affect all characteristics equally. As such, some caution should be taken when interpreting the relative strength of the modelled relationships.

Defining and describing target populations

For each age group, a cluster analysis was undertaken identifying groups of individuals within the 'at-risk population'. Multiple correspondence analysis was firstly used to redefine the key categorical predictors from the regression modelling into a smaller number of continuous variables, and these were then used to identify a number of clusters at each year of age for females and males jointly.

The youth population was next split into the late teen population (aged 15 to 19) and the early 20s population (aged 20 to 24). Five fairly distinct groups of people with similar characteristics and at particular risk of poor outcomes were identified within each of these age groups. For the early 20s population, risk was defined primarily using the welfare and corrections outcomes measures, as health and education outcomes could have already occurred at these ages, potentially conflating the risk and outcomes measures.

The identification of target population groupings was informed by the factors that were most predictive of poor outcomes in the regression analysis in Step 1 and the clusters identified in Step 2. They were constructed using the following guiding criteria:

  • Parsimony – target populations should be able to be identified using only a few criteria.
  • Separation – overlap between target populations should be minimised.
  • High sensitivity – most people identified as being at risk should fall into at least one target population.
  • High specificity – most people identified as not being at risk should fall outside of the target populations.
Page top