GBR-004

UKMOD-Essex — ML-Enhanced Regional Tax-Benefit Microsimulation

Download PDF
United Kingdom Europe & Central Asia High income Pilot / Controlled Trial Phase Confirmed

Centre for Microsimulation and Policy Analysis (CeMPA), Institute for Social and Economic Research (ISER), University of Essex; in collaboration with Essex County Council (lead author Frimpong is affiliated with both)

At a Glance

What it does Synthetic dataset generation — Policy analysis, learning and M&E
Who runs it Centre for Microsimulation and Policy Analysis (CeMPA), Institute for Social and Economic Research (ISER), University of Essex; in collaboration with Essex County Council (lead author Frimpong is affiliated with both)
Programme UKMOD (UK Tax-Benefit Microsimulation Model)
Confidence Confirmed
Deployment Status Pilot / Controlled Trial Phase
Key Risks Model-related risks
Key Outcomes Macro-validation shows strong alignment with external benchmarks.
Source Quality 3 sources — Government website / press release, Working paper / technical note, Academic journal article

UKMOD-Essex is a machine learning-enhanced extension of UKMOD — the United Kingdom's only fully open-access tax-benefit microsimulation model — developed by the Centre for Microsimulation and Policy Analysis (CeMPA) at the University of Essex in collaboration with Essex County Council. The system uses a Gradient Boosted Machine (GBM) algorithm, likely implemented in XGBoost based on tree output format characteristics, to solve a fundamental limitation in national survey-based policy modelling: the Family Resources Survey (FRS), which underpins UKMOD, is only statistically representative at the Government Office Region (GOR) level, making sub-national policy analysis at the local authority or county level unreliable.

The GBM operates as a propensity score estimator within a three-stage hybrid pipeline. In the first stage, the algorithm is trained on a merged dataset combining the national FRS (53,577 individuals in 25,045 households for the whole UK) with commercially available household-level data from Experian (1,861,043 individuals in 738,993 households for Greater Essex, covering practically 100% of the population at postcode level). The GBM uses 12 core covariates — including categorised age, tenure type, household size, presence and number of children by age band, labour market activity status, and equivalised income (residualised to prevent it disrupting covariate balance) — plus 5 interaction terms capturing complex relationships such as age-by-retirement-status and children-by-household-size. The model estimates a propensity score between 0 and 1 for each household, representing the probability of belonging to the regional Experian dataset versus the national FRS.

In the second stage, these propensity scores are converted to Inverse Probability Weights (IPW), which are stabilised and capped at the 99th percentile with the top 1% trimmed, and nearest-neighbour matching with caliper restriction is applied. In the third stage, Iterative Proportional Fitting (IPF/raking) calibrates the weights against official ONS population statistics for Greater Essex (1,841,192 individuals in 771,189 households across 14 districts including the unitary authorities of Southend-on-Sea and Thurrock) to ensure marginal distributions match for age groups, employment status, and household composition.

The GBM was chosen over Random Forest after testing both approaches: GBM achieved lower Standardized Mean Differences (SMDs) across all covariates, handled 10+ socioeconomic predictors without degradation, and captured complex multi-way interactions more effectively. Random Forest's adjusted covariate balance was actually worse than the unadjusted baseline when using more than 6 covariates.

The resulting reweighted dataset enables UKMOD's standard rules-based tax-benefit simulation engine to produce regional estimates of employment income distributions, tax liabilities, benefit entitlements, and distributional impacts of policy reforms at the Essex level. Macro-validation against external benchmarks shows strong alignment: median monthly employment income of GBP 2,392 (UKMOD-Essex) versus GBP 2,535 (ASHE), and self-employment income matching the Survey of Personal Incomes benchmark exactly at GBP 3.20 billion when filtered to comparable definitions.

The system is part of the EUROMOD family of models jointly developed with the European Commission. UKMOD is released under a CC BY-NC-ND 4.0 license (free, non-commercial), and the EUROMOD software engine is open-source under the EUPL-1.2 licence. The lead author, Rejoice Frimpong, is affiliated with both Essex County Council and CeMPA, confirming direct local government involvement in the development. The methodology is described in CeMPA Working Paper 9/25 (August 2025), and the authors note future directions including neural networks, XGBoost variants, hybrid ensemble models, and application to dynamic (not just static) microsimulation.

Classifications follow the DCI AI Hub Taxonomy. Hover over field labels for definitions.

Social Protection Functions

Policy
Legal and policy frameworks primary
Programme design
Benefits and service package
SP Pillar (Primary) The social protection branch: social assistance, social insurance, or labour market programmes. Social assistance
SP Pillar (Secondary) The social protection branch: social assistance, social insurance, or labour market programmes. Social insurance
Programme Name UKMOD (UK Tax-Benefit Microsimulation Model)
Programme Type The type of social protection programme, classified under social assistance, social insurance, or labour market programmes. View in glossary Other
System Level Where in the social protection system the AI is applied: policy level, programme design, or implementation/delivery chain. View in glossary Policy
Programme Description UKMOD is the UK's only fully open-access tax-benefit microsimulation model, covering all four nations (England, Scotland, Wales, Northern Ireland). Part of the EUROMOD family jointly developed with the European Commission. Simulates effects of taxes and social benefits on household incomes and work incentives. Both model and underlying data are freely available. Online version (UKMOD Explore) allows non-specialists to design and run policy scenarios.
Implementation Type How the AI output is produced: Classical ML, Deep learning, Foundation model, or Hybrid. Affects validation, compute requirements, and governance profile. View in glossary Classical ML
Lifecycle Stage Current stage in the AI lifecycle, from problem identification through to monitoring, maintenance and decommissioning. View in glossary Integration and Deployment
Model Provenance Origin of the AI model: developed in-house, adapted from open-source, commercial/proprietary, or accessed via third-party API. View in glossary Developed in-house
Compute Environment Where the AI system runs: on-premise, government cloud, commercial cloud, or edge/device. View in glossary On-premise
Sovereignty Quadrant Classification of data and compute sovereignty: I (Sovereign), II (Federated/Hybrid), III (Cloud with safeguards), or IV (Shared Innovation Zone). View in glossary I — Sovereign AI Zone
Data Residency Where the data used by the AI system is stored: domestic, regional, or international. View in glossary Domestic
Cross-Border Transfer Whether data crosses national borders, and if so, whether documented safeguards are in place. View in glossary None
Decision Criticality The rights impact of the decision the AI supports. High criticality requires HITL oversight; moderate requires HOTL; low may operate HOOTL. View in glossary Low
Human Oversight Type Level of human involvement: Human-in-the-Loop (active review), Human-on-the-Loop (monitoring), or Human-out-of-the-Loop (periodic audit). View in glossary HITL
Development Process Whether the AI system was developed fully in-house, through a mix of in-house and third-party, or fully by an external provider. View in glossary Fully in-house
Highest Risk Category The most significant structural risk source identified: data, model, operational, governance, or market/sovereignty risks. View in glossary Model-related risks
Risk Assessment Status Whether a formal risk assessment, informal assessment, or independent audit has been conducted for this system. Informal assessment

Impact Dimensions

Autonomy, human dignity and due process
Equality, non-discrimination, fairness and inclusion
  • Data minimisation controls
  • Independent evaluation
CategorySensitivityCross-System LinkageAvailabilityKey Constraints
Administrative data from other sectorsPersonalLinks data across multiple systemsCurrently available and usedExperian data is commercially available, compiled from administrative records and commercial sources. Not a probability sample — uses modelled estimates. Variable definitions may differ from FRS.
Survey and census dataNon-personalLinks data across multiple systemsCurrently available and usedFRS is representative at Government Office Region (GOR) level only — not at local authority level. This is the core limitation the ML approach addresses.

CeMPA (n.d.) 'UKMOD', Centre for Microsimulation and Policy Analysis, University of Essex.

View source Government website / press release

Frimpong, R. & Richiardi, M. (2025) 'Machine learning regionalisation of input data for microsimulation models: An application of a hybrid GBM / IPF method to build a tax-benefit model for the Essex region in the UK', CeMPA Working Paper 9/25, University of Essex.

View source Working paper / technical note

Richiardi, M., Collado, D. & Popova, D. (2021) 'UKMOD – A new tax-benefit model for the four nations of the UK', International Journal of Microsimulation, 14(1), pp. 92-108.

View source Academic journal article
Deployment Status How far the system has progressed into real-world operational use, from concept/exploration through to scaled and institutionalised. View in glossary Pilot / Controlled Trial Phase
Year Initiated The year the AI system was first initiated or development began. 2025
Scale / Coverage The scale and geographic or population coverage of the deployment. Greater Essex region — 1,841,192 individuals in 771,189 households (ONS, March 2023). Covers 14 Greater Essex districts including Southend-on-Sea and Thurrock unitary authorities. National FRS input: 53,577 individuals in 25,045 households. Experian regional data: 1,861,043 individuals in 738,993 households.
Funding Source The source(s) of funding for the AI system development and deployment. University of Essex / CeMPA (ESRC-funded centre). EUROMOD engine jointly funded with European Commission.
Technical Partners External technology vendors, academic partners, or development partners involved. Alliance for Microsimulation and Policy Analysis CIC (co-developer of UKMOD); Experian (commercial regional data provider); EUROMOD software engine (open-source, EUPL-1.2 license)
Outcomes / Results Macro-validation shows strong alignment with external benchmarks. Median monthly employment income: GBP 2,392 (UKMOD-Essex) vs GBP 2,535 (ASHE) — within expected range given different data sources/definitions. Self-employment income matches SPI benchmark exactly when filtered to comparable definitions (GBP 3.20 billion). Post-weighting Standardized Mean Differences below 0.1 for most variables. GBM outperformed Random Forest on covariate balance diagnostics.
Challenges GBM training is computationally expensive. ML introduces complexity in model selection, overfitting prevention, and interpretability. Performance depends on quality of Experian data — may not replicate in regions with sparse or inconsistent commercial data. Post-weighting raking still needed, indicating GBM alone cannot capture all dimensions of population heterogeneity.

How to Cite

DCI AI Hub (2026). 'UKMOD-Essex — ML-Enhanced Regional Tax-Benefit Microsimulation', AI Hub AI Tracker, case GBR-004. Digital Convergence Initiative. Available at: https://socialprotectionai.org/use-case/GBR-004 [Accessed: 1 April 2026].

Change History

Updated 31 Mar 2026, 06:35
by system (system)
Created 30 Mar 2026, 08:39
by v2-import (import)