Use case — LLM fine-tuning

Australian-aware AI, without the privacy paperwork

Use AUSynth to generate training corpora for LoRA fine-tuning, RLHF datasets, and instruction tuning, adapting your LLM to Australian demographic reality.

Why fine-tune for Australian context

Most open LLMs default to US demographic patterns. Defaults like dollar amounts, occupation distributions, suburb naming, and family composition all skew American because that's what dominates the pre-training corpus.

For Australian customer-facing AI (chatbots, financial advisors, customer support, government service), accurate AU demographic reasoning matters. A model that thinks "small business owner" means a US LLC structure, or that quotes US median income, lands wrong.

Standard fine-tuning needs realistic AU data, and that's where privacy and PII risk usually kill the project. AUSynth removes the PII problem by generating synthetic records from public ABS Census conditionals.

The row-to-text workflow

Each AUSynth row is structured demographic data. To feed it into an LLM training pipeline, you convert it to natural-language descriptions:

Row from AUSynth

{
  "age": 34,
  "occupation": "electrician",
  "suburb": "Penrith",
  "income": 82000,
  "family_composition": "couple_with_dependent_children",
  "tenure": "mortgage"
}

Natural-language description

A 34-year-old electrician living in Penrith earns around $82,000 per year, owns their home with a mortgage, and lives with their partner and dependent children.

Repeat across 5,000+ rows and you have a fine-tuning corpus grounded in real ABS Census 2021 statistics, with no real individuals at any step.

What you can do with it

  • LoRA fine-tuning. Adapter weights for Llama, Mistral, Qwen, etc. The AUSynth corpus shifts demographic defaults without the cost of a full fine-tune.
  • RLHF reward modelling. Generate AU-grounded preference pairs (which of these two descriptions is more plausibly Australian?).
  • Instruction tuning. AU-context Q&A pairs for customer-facing assistants.
  • Evaluation sets. Test whether your model knows AU demographics. AUSynth gives you ground-truth statistics to score against.

Example notebook

The Hugging Face dataset card includes a working LoRA fine-tuning notebook using PEFT + transformers on the 5k-row sample.

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd

# 1. Load AUSynth sample + convert rows to natural language
df = pd.read_parquet("ausynth_sample_paddington_4064.parquet")
texts = [row_to_text(row) for row in df.to_dict("records")]

# 2. Configure LoRA adapter for your base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, config)

# 3. Standard HF Trainer loop on the AUSynth-derived corpus
# ... train, save adapter, ship

Pricing

Free 5,000-row sample on Hugging Face. Full national dataset via standard credit bundles. No special licensing or enterprise contract required for commercial fine-tuning.

Honest caveats

AUSynth provides realistic demographic statistics, not behavioural data. It's well-suited to fine-tuning for AU demographic reasoning (who lives where, occupation x income distributions, family composition by age). It is not a substitute for behavioural or transactional data when those are the modelling target.

If your goal is "make this LLM sound Australian-aware," AUSynth is a strong fit. If your goal is "predict whether this specific customer will buy," you still need your own customer-behaviour data.

← Back to /for-ai-labs