UKMOD-Essex is a machine learning-enhanced extension of UKMOD — the United Kingdom's only fully open-access tax-benefit microsimulation model — developed by the Centre for Microsimulation and Policy Analysis (CeMPA) at the University of Essex in collaboration with Essex County Council. The system uses a Gradient Boosted Machine (GBM) algorithm, likely implemented in XGBoost based on tree output format characteristics, to solve a fundamental limitation in national survey-based policy modelling: the Family Resources Survey (FRS), which underpins UKMOD, is only statistically representative at the Government Office Region (GOR) level, making sub-national policy analysis at the local authority or county level unreliable.
The GBM operates as a propensity score estimator within a three-stage hybrid pipeline. In the first stage, the algorithm is trained on a merged dataset combining the national FRS (53,577 individuals in 25,045 households for the whole UK) with commercially available household-level data from Experian (1,861,043 individuals in 738,993 households for Greater Essex, covering practically 100% of the population at postcode level). The GBM uses 12 core covariates — including categorised age, tenure type, household size, presence and number of children by age band, labour market activity status, and equivalised income (residualised to prevent it disrupting covariate balance) — plus 5 interaction terms capturing complex relationships such as age-by-retirement-status and children-by-household-size. The model estimates a propensity score between 0 and 1 for each household, representing the probability of belonging to the regional Experian dataset versus the national FRS.
In the second stage, these propensity scores are converted to Inverse Probability Weights (IPW), which are stabilised and capped at the 99th percentile with the top 1% trimmed, and nearest-neighbour matching with caliper restriction is applied. In the third stage, Iterative Proportional Fitting (IPF/raking) calibrates the weights against official ONS population statistics for Greater Essex (1,841,192 individuals in 771,189 households across 14 districts including the unitary authorities of Southend-on-Sea and Thurrock) to ensure marginal distributions match for age groups, employment status, and household composition.
The GBM was chosen over Random Forest after testing both approaches: GBM achieved lower Standardized Mean Differences (SMDs) across all covariates, handled 10+ socioeconomic predictors without degradation, and captured complex multi-way interactions more effectively. Random Forest's adjusted covariate balance was actually worse than the unadjusted baseline when using more than 6 covariates.
The resulting reweighted dataset enables UKMOD's standard rules-based tax-benefit simulation engine to produce regional estimates of employment income distributions, tax liabilities, benefit entitlements, and distributional impacts of policy reforms at the Essex level. Macro-validation against external benchmarks shows strong alignment: median monthly employment income of GBP 2,392 (UKMOD-Essex) versus GBP 2,535 (ASHE), and self-employment income matching the Survey of Personal Incomes benchmark exactly at GBP 3.20 billion when filtered to comparable definitions.
The system is part of the EUROMOD family of models jointly developed with the European Commission. UKMOD is released under a CC BY-NC-ND 4.0 license (free, non-commercial), and the EUROMOD software engine is open-source under the EUPL-1.2 licence. The lead author, Rejoice Frimpong, is affiliated with both Essex County Council and CeMPA, confirming direct local government involvement in the development. The methodology is described in CeMPA Working Paper 9/25 (August 2025), and the authors note future directions including neural networks, XGBoost variants, hybrid ensemble models, and application to dynamic (not just static) microsimulation.