How leadforge works

leadforge generates datasets by simulating a commercial world, not by sampling rows from a distribution. This distinction matters:

A distribution-sampler can reproduce the statistical shape of a CRM dataset.
A world-simulator produces rows that have reasons — leads convert because they fit the ICP, have high urgency, and were engaged by a persistent SDR; leads don't convert because they stalled in technical review, or because the champion left the company.

That structure is what makes the data useful for teaching: there is something real to find, and it can be found with the right feature engineering and model choices.

The generation pipeline

Generation runs in five sequential layers, each deterministic given the same seed:

1. Hidden world structure   ← sample motif family, rewire DAG
         ↓
2. Mechanism layer          ← assign mechanisms to every node
         ↓
3. Population layer         ← create accounts, contacts, leads with latent traits
         ↓
4. Simulation               ← run 90-day daily event loop
         ↓
5. Rendering                ← snapshot-safe feature extraction + relational export

1. Hidden world structure

A directed acyclic graph (DAG) of latent traits, pipeline states, and the conversion outcome is sampled from one of five motif families and then stochastically rewired. The motif families are:

Family	What drives conversion
`fit_dominant`	Account/ICP fit is the primary signal
`intent_dominant`	Buying intent signals (sessions, demo requests) dominate
`sales_execution_sensitive`	SDR and AE behaviour is the strongest lever
`demo_trial_mediated`	Conversion is gated on a demo or trial event
`buying_committee_friction`	Multi-stakeholder dynamics create the main noise

2. Mechanism layer

Every node in the sampled graph gets a concrete mechanism — a logistic latent score, Poisson intensity, recency-decayed engagement intensity, categorical channel influence, stage transition hazard, or conversion hazard. Parameters are calibrated per difficulty tier.

3. Population layer

Accounts (1,500), contacts (4,200), and leads (5,000) are instantiated with deterministic IDs (acct_000001, lead_000001) and latent trait vectors drawn from the world graph.

4. Simulation

A hybrid discrete-time simulator runs a 90-day daily loop. Each day, each active lead may:

receive a touch (email, call, demo, etc.)
generate a session
receive a sales activity
advance or stall in the pipeline stage sequence
convert (via a calibrated hazard function)

Everything is event-derived — the converted_within_90_days label emerges from simulated events, not from a directly sampled Bernoulli.

5. Rendering

The simulation state is projected into:

9 relational tables — snapshot-filtered to ≤ anchor day for public bundles
A flat ML-ready task table (the train/valid/test splits)
Metadata files (manifest, feature dictionary, dataset card)

The exposure mode controls what gets written.

Reproducibility

All generation is deterministic given (recipe, config, seed, package version). The seed is recorded in manifest.json along with the package version, so any bundle can be exactly reproduced.

The generation pipeline​

1. Hidden world structure​

2. Mechanism layer​

3. Population layer​

4. Simulation​

5. Rendering​

Reproducibility​