For AI / ML teams

Australian training and test data. No PII, no privacy review.

Privacy-preserving synthetic Australian demographic data, calibrated to ABS Census 2021, for ML training, customer profiling, demo data, and fairness testing.

Get the free 5k sample →Read the methodology →Free 5k sample · open-source validator

Built for ML / AI workflows

Hugging Face dataset
Pandas / Polars / PyTorch / TensorFlow ready
Parquet / CSV / Arrow
Validated to SRMSE 0.05
8 demographic profiles
Family + dwelling structure preserved
15,343 suburb granularity
Geographically coherent
Commercial license

Products

Self-serve. Pay as you go. Free tiers available.

Profile Scoring API

Send only the distinct, de-identified demographic combinations in any dataset (never raw records), and get the matching Australian profile for each. For segmentation, personas, and sizing.

1 credit per 100 combinations.

API docs →

Individual-level Data

Bulk synthetic Australian population records (up to 27.5M), individual-level, ABS Census 2021 calibrated. For ML training, fine-tuning, demos.

Free 5,000-row sample on Hugging Face.

Free sample →

Validator (web + library)

Test an ML model for bias against Australian demographics, in the browser wizard or the open-source Python library. Free tier uses the 5k sample.

Free · MIT licensed.

Test your model →

Validator Pro

The full fairness audit against the entire national dataset (~27.5M records). Production-grade power; runs async in minutes.

50 credits per run.

Read the docs →

Methodology, briefly

Generated from ABS Census 2021 conditional tables via Bayesian reconstruction with Gibbs sampling, preserving the real cross-tabs between age, sex, income, occupation, education, industry, household and dwelling.

Population marginals match real Census at 99.9%+; cross-tab fidelity sits at median person-level SRMSE 0.05 across 15,343 suburbs, adjusted to 2025–26. Read the full methodology →

Pricing

Free tier: 5 credits per week (never expires), the free 5k Hugging Face sample, and the open-source validator free tier.

Paid credits start at $10. 1 credit = 500 synthetic records; bundles never expire. See all pricing →

FAQ

Is synthetic data really good enough for ML training?
For population-level features (income, age, occupation, education, family structure) calibrated to a real Census, yes. AUSynth preserves the joint distributions between variables, not just the marginals, so models learn the same correlations they would from the underlying real data. The catch: synthetic data can't substitute for your real customer signal. Use AUSynth for the demographic substrate, your own data for the business outcomes you're modelling.
How does this compare to differential privacy?
Differential privacy adds calibrated noise to query results from a real dataset. AUSynth doesn't query a real dataset at all. Every record is generated from public Census conditionals, so there's no per-individual privacy budget to spend.
Can I use this for commercial models?
Yes. The credit bundles include commercial use rights. Cite as: Verosynthea AUSynth v1.0 (2026). verosynthea.com.
Can I use AUSynth to fine-tune LLMs?
Yes. We provide a row-to-text workflow that converts AUSynth records into natural-language descriptions suitable for LoRA fine-tuning, RLHF datasets, and instruction tuning. See /use-cases/llm-fine-tuning for the full pattern.
What's the difference between the Validator free and Pro tiers?
The free tier (web wizard or open-source Python library) tests your model against a 5,000-row sample, enough to catch obvious bias. Pro runs the same fairness metrics against the FULL national dataset (~27.5M records) for compliance-grade audits; runs are async (5–15 min) at 50 credits each.

Ready to try it?

Start with the free Hugging Face sample (no signup needed), or test a model for bias right now.

Get the free 5k sample →See the full data product →

Built for ML / AI workflows

Products

Profile Scoring API

Individual-level Data

Validator (web + library)

Validator Pro

Methodology, briefly

Pricing

FAQ

Is synthetic data really good enough for ML training?

How does this compare to differential privacy?

Can I use this for commercial models?

Can I use AUSynth to fine-tune LLMs?

What's the difference between the Validator free and Pro tiers?

Ready to try it?