Verosynthea: Individual-level Australian population data calibrated to ABS Census

GitHub Hugging Face dataset For AI/ML teams

Web validator: dictionary, mapping, buckets

The web validator runs the same counterfactual test with no code. You upload a model and a data dictionary; it holds your other features at realistic Australian combinations and varies only age, sex, and country of birth. The steps are upload model, upload dictionary, confirm mappings, assign buckets, run.

Data dictionary format

A CSV with three columns describing each model feature. This replaces guessing types and default values.

feature_name,type,values
age,numeric,"18-75"
sex,categorical,"M,F"
income,numeric,"0-500000"
occupation,categorical,"engineer,teacher,nurse,driver,manager,other"
country_of_birth,categorical,"AU,NZ,GB,CN,IN,Other"
credit_score,numeric,"300-850"

type is one of numeric, categorical, or boolean.
numeric values are a min-max range; categorical values are a comma-separated list.
Errors are actionable: a categorical declared with a range, a numeric with non-numeric values, an empty value list, or an unknown type each name the row and the fix.

Download the template CSV.

Mapping and buckets

The validator auto-detects which of your features are age, sex, and country of birth, and pre-fills the mapping. Mapping is a confirmation step, not a form: suggestions are pre-filled, clearing one is a single click, and skipping them all still runs (the test then reports near-zero direct sensitivity, which is the expected result for a model that takes no protected attribute as input).

Bucket assignment maps each of your own values to a reference category, so slices match real Australian combinations, for example your M and F to Male and Female. Bucket assignment is a Pro feature; on the free tier mapped categorical features fall back to dictionary sampling.

Country of birth

Birthplace is tested against the ABS one-digit birthplace regions, not a simple Australia versus overseas split. Your values are assigned to regions with an auto-mapped pre-fill that recognises ISO codes and common country names; you review it. Where one region maps to several of your values, the report states the weighting rule it used: ABS birthplace region shares when the values are recognised, or a uniform average otherwise. Results show region-level gaps plus an Australia versus overseas rollup.

The validator maps to protected-attribute definitions and helps evidence how a model treats people. It is not a certification, and Verosynthea is not a certification body.

Install

From PyPI:

pip install verosynthea-validator

Requires Python 3.9+. Pulls in pandas>=1.5 and numpy>=1.23. Optional extras: datasets for the Hugging Face sample loader, httpx for the (forthcoming) paid-API client.

Quickstart

Load the free Hugging Face sample, score your model against it, and print a fairness report:

from verosynthea_validator import load_ausynth_sample, FairnessReport

df = load_ausynth_sample()                     # 5,000 rows from the HF dataset
df["prediction"] = my_model.predict(df)

report = FairnessReport(
    df,
    y_true="label",
    y_pred="prediction",
    protected_columns=["SEXP", "BPLP", "profile_name"],
)
print(report.run().summary())

The sample lives at huggingface.co/datasets/vero-synthea/ausynth-sample — 27 columns covering age, sex, income, occupation, education, family structure, and the 8 demographic profile assignments.

API reference

FairnessReport

Class that takes a scored DataFrame and a list of protected columns, computes group-wise metrics, and returns a structured report you can serialise or display.

FairnessReport(
    data: pd.DataFrame,
    y_true: str,
    y_pred: str,
    protected_columns: list[str],
)
.run() -> FairnessResults

FairnessResults.summary() renders a one-screen ASCII table. FairnessResults.to_dict()gives you a serialisable structure for logging / dashboards.

assert_fair

CI-gate helper. Raises FairnessAssertionError if any configured threshold is exceeded — drop into your test suite or model release pipeline.

assert_fair(
    data: pd.DataFrame,
    y_true: str,
    y_pred: str,
    *,
    max_accuracy_gap: float            = 0.05,
    max_demographic_parity_gap: float  = 0.10,
    max_equalised_odds_gap: float      = 0.10,
    protected_columns: list[str] | None = None,
) -> None

load_ausynth_sample

Convenience loader that downloads the free Hugging Face sample.

load_ausynth_sample(
    suburb: str = "paddington_4064",   # only the bundled sample for now
    cache_dir: str | None = None,
) -> pd.DataFrame

CI / CD gate

Block a model from shipping if fairness degrades beyond your thresholds. Drop this into pytest:

# tests/test_fairness.py
import pandas as pd
from verosynthea_validator import assert_fair
from my_app import score_batch

def test_model_is_fair_across_demographics():
    df = pd.read_parquet("fixtures/holdout.parquet")
    df["prediction"] = score_batch(df)

    assert_fair(
        df,
        y_true="label",
        y_pred="prediction",
        max_accuracy_gap=0.05,
        max_demographic_parity_gap=0.08,
        protected_columns=["SEXP", "BPLP", "profile_name"],
    )

Or call it as a GitHub Actions step against the HF sample so PRs get a fairness verdict before merge:

# .github/workflows/fairness.yml
- name: Fairness gate
  run: |
    pip install verosynthea-validator
    python -m my_app.evaluate_fairness  # imports assert_fair

Metrics computed

Every metric is computed from your model’s predictions alone, so none of them needs a labelled test set. They measure the direct effect of a protected attribute: how the prediction changes when nothing changes except age, sex, or country of birth.

Metric	What it measures	Tier
Counterfactual gap	Prediction change when only the protected attribute flips, as a population-weighted distribution (mean, p50, p90, max). The headline, severity-banded.	Free + Pro
Demographic parity gap	Difference in weighted positive rate across groups.	Free + Pro
Flip / exceedance rate	Share of combinations where the decision flips (hard label), or the score moves more than 10 points (probabilities).	Free + Pro
Disparate impact ratio	Lowest group positive rate divided by the highest (the ratio view of parity).	Free + Pro
Rank parity (xAUC)	How often one group scores above another. 0.5 is identical ranking.	Pro
Distribution distance	1-Wasserstein distance between the groups' score distributions, with an overlay plot.	Pro
Variance share	Share of total prediction variance attributable to the protected grid.	Pro

Metrics that compare predictions to real outcomes, such as equalised odds, predictive parity, and calibration, require your labelled data and are out of scope for this test. This validator measures how your model treats people; whether its accuracy is equal across groups is a question only your own labelled data can answer.

Comparison to fairlearn / aif360

Both fairlearn and aif360 are broader fairness toolkits — many metrics, many bias-mitigation algorithms, lots of configuration. They're excellent if you're doing fairness research or building a custom pipeline.

verosynthea-validator is narrower on purpose: one Australia-calibrated reference population (AUSynth's 8 demographic profiles) and a one-liner CI gate. If you're shipping a model that touches Australian customers and you want a cheap pre-deploy check that you didn't accidentally trade accuracy for fairness in one demographic group, use this. If you want to compose 30 mitigation strategies, use fairlearn.

Next steps

Read the README on GitHub for the full repo, tests, and issue tracker.
Skim the AUSynth dataset card on Hugging Face to see the columns the validator scores against.
For the full Australian population (not just the 5k sample), buy a data bundle — same calibration, same columns, just bigger.
For context on what AUSynth is, see /for-ai-labs or the methodology overview.