leadforge

Narrative-grounded synthetic CRM datasets generated from simulated commercial worlds — for teaching, benchmarks, and research.

Get started →Browse on HuggingFace GitHub

installpip install leadforge

Why leadforge?

Public lead-scoring datasets are too small, too overused, or too shallow to sustain serious teaching or research. leadforge generates datasets that feel like they came from a real CRM.

🌐

Simulated commercial worlds

Data isn't sampled from a distribution — it emerges from a simulated company, product, buyers, and go-to-market motion, making every dataset narratively consistent.

🔀

Variable hidden DGP

Five motif families (fit-dominant, intent-dominant, sales-execution-sensitive, demo/trial-mediated, buying-committee-friction) are stochastically rewired so no two datasets share the same causal structure.

📐

Three difficulty tiers

Intro, intermediate, and advanced — calibrated by signal-to-noise ratio and conversion rate so you can benchmark a novice project, a serious model, or a stress-test in the same framework.

🔍

Full truth for instructors

The instructor companion ships the hidden causal graph, latent registry, mechanism summary, and full-horizon relational tables — everything redacted from the student view.

🔗

9-table relational output

Accounts, contacts, leads, touches, sessions, sales activities, opportunities, customers, and subscriptions — all with deterministic IDs and FK integrity.

🔒

Leakage-free by construction

Every public feature is snapshot-safe: no post-anchor events, no terminal-stage columns, no conversion-conditional tables. The redaction contract is code, not convention.

CLI

# Generate a full bundle
leadforge generate \
  --recipe b2b_saas_procurement_v1 \
  --seed 42 --mode student_public \
  --difficulty intermediate \
  --n-leads 5000 --out ./out/bundle

# Inspect & validate
leadforge inspect ./out/bundle
leadforge validate ./out/bundle

Python API

from leadforge.api import Generator

gen = Generator.from_recipe(
    "b2b_saas_procurement_v1",
    seed=42,
    exposure_mode="student_public",
)
bundle = gen.generate(
    n_leads=5000,
    difficulty="intermediate",
)
bundle.save("./out/bundle")

Three difficulty tiers, one dataset family

All tiers share the same fictional company and causal structure. Only signal strength, noise, and missingness differ.

introStrong signal, low noise. Good for first-time learners and sanity-checking pipelines.AUC ≈ 0.89 · ~28% conversion

intermediateRealistic noise, moderate signal. The canonical benchmark tier.AUC ≈ 0.79 · ~18% conversion

advancedHigh noise, weak signal, rare positive class. Challenges experienced practitioners.AUC ≈ 0.68 · ~8% conversion

Each tier ships 5,000 leads · 70 / 15 / 15 train/valid/test Parquet splits · 9-table relational bundle

Ready to use it?

Download the v1 dataset on HuggingFace or Kaggle, or generate your own with the Python package.

HuggingFace dataset ↗Read the docs