Skip to main content

Generation method — leadforge-lead-scoring-v1

A standalone summary of how the dataset is generated, written for external readers. Read this before opening the bundle if you want to know what the data is and how much you can trust each piece of it; for the full architecture, see Architecture spec

What the dataset is

leadforge-lead-scoring-v1 is a synthetic mid-market B2B SaaS lead-scoring dataset generated by leadforge, an open-source Python framework. Every row, event, and edge is produced by code in this repository — there is no real CRM behind the data. The generator is deterministic given a fixed (recipe, configuration, seed, package version) tuple, and the recipe and seed are recorded in each bundle's manifest.json.

The published family contains three difficulty tiers — intro, intermediate, and advanced — sharing one fictional company narrative ("Veridian Procure", a procurement / AP automation SaaS). The tiers differ only in noise, missingness, and signal strength, modulated by a difficulty profile that the simulator consumes; the underlying causal structure is identical. A separate *_instructor companion ships the full hidden truth (causal graph, latent registry, mechanism summary, full-horizon relational tables).

Generation pipeline at a glance

Generation runs in five layers, top to bottom. Every layer is deterministic, every layer is seeded from a single root via named substreams, and every layer is testable in isolation.

  1. Hidden world structure. A directed acyclic graph (DAG) of latent traits, lead states, sales-process states, and the Converted within 90 days outcome node, sampled from one of five motif families and then perturbed by stochastic rewiring. The motif families are intentionally non-uniform: fit_dominant, intent_dominant, sales_execution_sensitive, demo_trial_mediated, buying_committee_friction. Two independently-sampled bundles share neither the exact graph nor the edge weights, but they share the constraint that the graph is acyclic, every node is reachable from a root, and the outcome node is reachable from every non-root subgraph.
  2. Mechanism layer. Every node in the sampled graph receives a concrete mechanism — a logistic latent score, a Poisson intensity for touch counts, a recency-decayed engagement intensity for sessions, a categorical influence for source channel, a stage transition hazard, a conversion hazard, etc. Mechanisms are assigned by motif family, so a fit_dominant graph and an intent_dominant graph end up with materially different behavior at simulation time. Mechanism parameters are calibrated so each tier hits its target conversion-rate band; the intermediate tier is the canonical difficulty profile.
  3. Population layer. Accounts (1,500), contacts (4,200), and leads (5,000) are drawn with deterministic foreign keys and ID-stable namespaces (acct_000001, lead_000001, …). Each entity carries a vector of latent traits seeded from the world graph: account fit, process maturity, contact authority, problem awareness, urgency, etc. Industry, region, employee band, role, and seniority are all drawn from the recipe's narrative spec; firmographic correlations come from motif-family latent biases applied during sampling.
  4. Simulation engine. A 90-day discrete-time simulator advances every lead day-by-day from MQL through the funnel (mql → sal → sql → demo_scheduled → demo_completed → proposal_sent → negotiation → closed_won/closed_lost). Each day, hazards from the mechanism layer fire: stage transitions, touches (inbound vs outbound, recency-decayed), web sessions (pricing-page views, demo-page views), sales activities, churn, and direct conversion for unusual fast paths. Once a lead reaches closed_won, opportunities, customers, and subscriptions materialise with deterministic foreign keys. converted_within_90_days is event-derived: it is true iff a closed_won event occurred within the configured label window, never sampled directly.
  5. Snapshot rendering. For every lead, the renderer freezes a feature snapshot at snapshot_day (30 days for v1). Aggregates such as touch_count, session_count, pricing_page_views, expected_acv, and days_since_last_touch only see events on days [0, snapshot_day]; the label resolves over the full 90-day horizon. The deliberate exception is total_touches_all, which counts the full-horizon touch history and is flagged as a pedagogical leakage trap in the feature dictionary.

Bundle output

Each bundle writes a fixed directory layout — a manifest, dataset card, feature dictionary, relational tables, and the converted_within_90_days task split. The manifest records the recipe, seed, package version, exposure mode, snapshot day, label window, schema version, table inventory with row counts, SHA-256 hashes for every file, and the exact set of redacted columns. Two runs with the same (recipe, seed, version) produce byte-identical bundles modulo the wall-clock generation_timestamp field; scripts/verify_hash_determinism.py enforces this.

The public (student_public) bundle and the instructor companion share the same generator run; they differ only in what is published. Filtering happens during rendering, not during simulation:

  • Public bundles route relational tables through to_dataframes_snapshot_safe, which (a) filters event tables per-lead by lead_created_at + snapshot_day, (b) drops terminal-state columns from leads and opportunities, and (c) omits customers and subscriptions entirely (their presence is conversion-conditional).
  • Instructor companions skip the snapshot-safe writer and ship full-horizon tables plus a metadata/ directory containing the hidden world graph, latent registry, mechanism summary, and full world spec. They are not appropriate input for the student-facing task.

The exact column lists are pinned by BANNED_LEAD_COLUMNS, BANNED_OPP_COLUMNS, BANNED_TABLES, and SNAPSHOT_FILTERED_TABLES in leadforge/validation/leakage_probes.py; the validator imports the same constants the writer uses, so the contract is single-sourced.

Calibration and validation

Difficulty calibration is empirical, not analytic: the intermediate tier is sampled, the conversion-rate band is checked, and the signal-strength multiplier is tuned until five seeds (42–46) hit the target band with stable variance. The intro and advanced tiers reuse the same mechanism assignments with different distortion parameters (Gaussian noise on float features, MCAR missingness, outlier injection) calibrated the same way.

Every claim made about realism, calibration, or difficulty is backed by release/validation/validation_report.md, which is regenerated by scripts/validate_release_candidate.py. The driver runs the full release-quality panel — per-tier ROC-AUC, PR-AUC, log loss, Brier, calibration bins, lift, P@K, top-decile rate, expected-ACV capture, model-family deltas, cross-seed bands, random-vs-cohort split degradation, and the full leakage probe taxonomy — and exits non-zero if anything falls outside the bands declared in docs/release/v1_acceptance_gates_bands.yaml.

What this is not

  • Not a substitute for real CRM data. The vertical, narrative, and motif families are deliberate fictions chosen to teach lead-scoring patterns without exposing real customer data.
  • Not a benchmark. The difficulty tiers are calibrated for pedagogy, not for cross-paper comparability.
  • Not a temporally rich dataset. The simulator runs in daily steps over a 90-day horizon. Sales-cycle distributions are whatever falls out of the daily hazards, not log-normal / Weibull tails. Demographic strings are clean (no free-text-job-title messiness). Both are tracked as post-v1 scope in docs/release/post_v1_roadmap.md.

Further reading

For the deeper design rationale — why a DAG, why motif families, why event-derived labels, why public-vs-instructor — see Design doc Architecture spec contributors and document the package internals; this doc stays at the conceptual level external readers need.

Design doc Architecture spec