Skip to main content

leadforge

Narrative-grounded synthetic CRM datasets generated from simulated commercial worlds โ€” for teaching, benchmarks, and research.

installpip install leadforge

Why leadforge?

Public lead-scoring datasets are too small, too overused, or too shallow to sustain serious teaching or research. leadforge generates datasets that feel like they came from a real CRM.

๐ŸŒ
Simulated commercial worlds

Data isn't sampled from a distribution โ€” it emerges from a simulated company, product, buyers, and go-to-market motion, making every dataset narratively consistent.

๐Ÿ”€
Variable hidden DGP

Five motif families (fit-dominant, intent-dominant, sales-execution-sensitive, demo/trial-mediated, buying-committee-friction) are stochastically rewired so no two datasets share the same causal structure.

๐Ÿ“
Three difficulty tiers

Intro, intermediate, and advanced โ€” calibrated by signal-to-noise ratio and conversion rate so you can benchmark a novice project, a serious model, or a stress-test in the same framework.

๐Ÿ”
Full truth for instructors

The instructor companion ships the hidden causal graph, latent registry, mechanism summary, and full-horizon relational tables โ€” everything redacted from the student view.

๐Ÿ”—
9-table relational output

Accounts, contacts, leads, touches, sessions, sales activities, opportunities, customers, and subscriptions โ€” all with deterministic IDs and FK integrity.

๐Ÿ”’
Leakage-free by construction

Every public feature is snapshot-safe: no post-anchor events, no terminal-stage columns, no conversion-conditional tables. The redaction contract is code, not convention.

CLI
# Generate a full bundle
leadforge generate \
--recipe b2b_saas_procurement_v1 \
--seed 42 --mode student_public \
--difficulty intermediate \
--n-leads 5000 --out ./out/bundle

# Inspect & validate
leadforge inspect ./out/bundle
leadforge validate ./out/bundle
Python API
from leadforge.api import Generator

gen = Generator.from_recipe(
"b2b_saas_procurement_v1",
seed=42,
exposure_mode="student_public",
)
bundle = gen.generate(
n_leads=5000,
difficulty="intermediate",
)
bundle.save("./out/bundle")

Three difficulty tiers, one dataset family

All tiers share the same fictional company and causal structure. Only signal strength, noise, and missingness differ.

introStrong signal, low noise. Good for first-time learners and sanity-checking pipelines.AUC โ‰ˆ 0.89 ยท ~28% conversion
intermediateRealistic noise, moderate signal. The canonical benchmark tier.AUC โ‰ˆ 0.79 ยท ~18% conversion
advancedHigh noise, weak signal, rare positive class. Challenges experienced practitioners.AUC โ‰ˆ 0.68 ยท ~8% conversion

Each tier ships 5,000 leads ยท 70 / 15 / 15 train/valid/test Parquet splits ยท 9-table relational bundle

Ready to use it?

Download the v1 dataset on HuggingFace or Kaggle, or generate your own with the Python package.