Generation method — leadforge-lead-scoring-v1
A standalone summary of how the dataset is generated, written for external readers. Read this before opening the bundle if you want to know what the data is and how much you can trust each piece of it; for the full architecture, see Architecture spec
What the dataset is
leadforge-lead-scoring-v1 is a synthetic mid-market B2B SaaS
lead-scoring dataset generated by
leadforge, an
open-source Python framework. Every row, event, and edge is produced
by code in this repository — there is no real CRM behind the data.
The generator is deterministic given a fixed
(recipe, configuration, seed, package version) tuple, and the
recipe and seed are recorded in each bundle's manifest.json.
The published family contains three difficulty tiers — intro,
intermediate, and advanced — sharing one fictional company
narrative ("Veridian Procure", a procurement / AP automation SaaS).
The tiers differ only in noise, missingness, and signal strength,
modulated by a difficulty profile that the simulator consumes; the
underlying causal structure is identical. A separate
*_instructor companion ships the full hidden truth (causal graph,
latent registry, mechanism summary, full-horizon relational tables).
Generation pipeline at a glance
Generation runs in five layers, top to bottom. Every layer is deterministic, every layer is seeded from a single root via named substreams, and every layer is testable in isolation.
- Hidden world structure. A directed acyclic graph (DAG) of
latent traits, lead states, sales-process states, and the
Converted within 90 daysoutcome node, sampled from one of five motif families and then perturbed by stochastic rewiring. The motif families are intentionally non-uniform:fit_dominant,intent_dominant,sales_execution_sensitive,demo_trial_mediated,buying_committee_friction. Two independently-sampled bundles share neither the exact graph nor the edge weights, but they share the constraint that the graph is acyclic, every node is reachable from a root, and the outcome node is reachable from every non-root subgraph. - Mechanism layer. Every node in the sampled graph receives a
concrete mechanism — a logistic latent score, a Poisson intensity
for touch counts, a recency-decayed engagement intensity for
sessions, a categorical influence for source channel, a stage
transition hazard, a conversion hazard, etc. Mechanisms are
assigned by motif family, so a
fit_dominantgraph and anintent_dominantgraph end up with materially different behavior at simulation time. Mechanism parameters are calibrated so each tier hits its target conversion-rate band; theintermediatetier is the canonical difficulty profile. - Population layer. Accounts (1,500), contacts (4,200), and
leads (5,000) are drawn with deterministic foreign keys and
ID-stable namespaces (
acct_000001,lead_000001, …). Each entity carries a vector of latent traits seeded from the world graph: account fit, process maturity, contact authority, problem awareness, urgency, etc. Industry, region, employee band, role, and seniority are all drawn from the recipe's narrative spec; firmographic correlations come from motif-family latent biases applied during sampling. - Simulation engine. A 90-day discrete-time simulator
advances every lead day-by-day from MQL through the funnel
(
mql → sal → sql → demo_scheduled → demo_completed → proposal_sent → negotiation → closed_won/closed_lost). Each day, hazards from the mechanism layer fire: stage transitions, touches (inbound vs outbound, recency-decayed), web sessions (pricing-page views, demo-page views), sales activities, churn, and direct conversion for unusual fast paths. Once a lead reachesclosed_won, opportunities, customers, and subscriptions materialise with deterministic foreign keys.converted_within_90_daysis event-derived: it is true iff aclosed_wonevent occurred within the configured label window, never sampled directly. - Snapshot rendering. For every lead, the renderer freezes a
feature snapshot at
snapshot_day(30 days for v1). Aggregates such astouch_count,session_count,pricing_page_views,expected_acv, anddays_since_last_touchonly see events on days[0, snapshot_day]; the label resolves over the full 90-day horizon. The deliberate exception istotal_touches_all, which counts the full-horizon touch history and is flagged as a pedagogical leakage trap in the feature dictionary.
Bundle output
Each bundle writes a fixed directory layout — a manifest, dataset
card, feature dictionary, relational tables, and the
converted_within_90_days task split. The manifest records the
recipe, seed, package version, exposure mode, snapshot day, label
window, schema version, table inventory with row counts, SHA-256
hashes for every file, and the exact set of redacted columns. Two
runs with the same (recipe, seed, version) produce byte-identical
bundles modulo the wall-clock generation_timestamp field;
scripts/verify_hash_determinism.py enforces this.
The public (student_public) bundle and the instructor companion
share the same generator run; they differ only in what is
published. Filtering happens during rendering, not during
simulation:
- Public bundles route relational tables through
to_dataframes_snapshot_safe, which (a) filters event tables per-lead bylead_created_at + snapshot_day, (b) drops terminal-state columns fromleadsandopportunities, and (c) omitscustomersandsubscriptionsentirely (their presence is conversion-conditional). - Instructor companions skip the snapshot-safe writer and ship
full-horizon tables plus a
metadata/directory containing the hidden world graph, latent registry, mechanism summary, and full world spec. They are not appropriate input for the student-facing task.
The exact column lists are pinned by BANNED_LEAD_COLUMNS,
BANNED_OPP_COLUMNS, BANNED_TABLES, and
SNAPSHOT_FILTERED_TABLES in
leadforge/validation/leakage_probes.py; the validator imports the
same constants the writer uses, so the contract is single-sourced.
Calibration and validation
Difficulty calibration is empirical, not analytic: the intermediate tier is sampled, the conversion-rate band is checked, and the signal-strength multiplier is tuned until five seeds (42–46) hit the target band with stable variance. The intro and advanced tiers reuse the same mechanism assignments with different distortion parameters (Gaussian noise on float features, MCAR missingness, outlier injection) calibrated the same way.
Every claim made about realism, calibration, or difficulty is
backed by release/validation/validation_report.md, which is
regenerated by scripts/validate_release_candidate.py. The driver
runs the full release-quality panel — per-tier ROC-AUC, PR-AUC, log
loss, Brier, calibration bins, lift, P@K, top-decile rate,
expected-ACV capture, model-family deltas, cross-seed bands,
random-vs-cohort split degradation, and the full leakage probe
taxonomy — and exits non-zero if anything falls outside the bands
declared in docs/release/v1_acceptance_gates_bands.yaml.
What this is not
- Not a substitute for real CRM data. The vertical, narrative, and motif families are deliberate fictions chosen to teach lead-scoring patterns without exposing real customer data.
- Not a benchmark. The difficulty tiers are calibrated for pedagogy, not for cross-paper comparability.
- Not a temporally rich dataset. The simulator runs in
daily steps over a 90-day horizon. Sales-cycle distributions
are whatever falls out of the daily hazards, not log-normal /
Weibull tails. Demographic strings are clean (no
free-text-job-title messiness). Both are tracked as post-v1
scope in
docs/release/post_v1_roadmap.md.
Further reading
For the deeper design rationale — why a DAG, why motif families, why event-derived labels, why public-vs-instructor — see Design doc Architecture spec contributors and document the package internals; this doc stays at the conceptual level external readers need.