v1 Acceptance Gates

Concrete, machine-checkable criteria for "v1 ready". A release candidate that satisfies every gate below can be tagged and published.

This file is the human-readable contract. Numeric bands are tuned in the companion YAML (v1_acceptance_gates_bands.yaml) — that file is loaded by scripts/validate_release_candidate.py and is the single source of truth for the per-band numbers. This document records the medians and rationale.

Initial calibration: 2026-05-06 from the PR 3.3 N=5 sweep on the regenerated PR 2.2 bundles (BUNDLE_SCHEMA_VERSION 5; see release/validation/validation_report.json). Re-tune when the recipe, mechanism layer, or difficulty profiles change.

Naming and versioning gate

G1.1 Dataset release name: leadforge-lead-scoring-v1. Locked in Phase 1 (PR #61 milestone rename + roadmap edits; reaffirmed in PR 1.1's docs/release/v1_current_state_audit.md).
G1.2 Kaggle slug: leadforge-lead-scoring-v1.
G1.3 Hugging Face repo: leadforge-lead-scoring-v1 (public family) and leadforge-lead-scoring-v1-instructor (companion).
G1.4 Bundle package_version reflects the leadforge package at build time.
G1.5 Bundle bundle_schema_version == 5.

Reproducibility gate

G2.1 Two independent builds with the same --generation-timestamp produce byte-identical bundles modulo timestamp-derived fields. Verified by scripts/verify_hash_determinism.py.
G2.2 All file SHA-256 hashes recorded in manifest.json match the actual files at validation time.
G2.3 A clean-environment regeneration on a different machine produces identical bundles to the developer's build (if not literally identical, deviations must be explainable solely by the timestamp field).

Structural gate

G3.1 Every bundle in the family contains manifest.json, dataset_card.md, feature_dictionary.csv, tables/, tasks/.
G3.2 Every required relational table for the bundle's mode is present and non-empty.
G3.3 All foreign-key constraints in ALL_CONSTRAINTS hold.
G3.4 All task splits (train, valid, test) are non-empty and disjoint.

Relational leakage gate (the v1 critical gate)

This is the gate that motivates the v1 release. Failures here are blockers.

G4.1 Public tables/leads.parquet does not contain converted_within_90_days or conversion_timestamp.
G4.2 Public tables/opportunities.parquet does not contain close_outcome or closed_at.
G4.3 Public bundles do not contain tables/customers.parquet or tables/subscriptions.parquet.
G4.4 Public event tables contain no rows past the snapshot: no touches row with touch_timestamp > lead_created_at + snapshot_day, no sessions row with session_timestamp > lead_created_at + snapshot_day, no sales_activities row with activity_timestamp > lead_created_at + snapshot_day. Public opportunities rows must satisfy created_at <= lead_created_at + snapshot_day.
G4.5 Probabilistic relational reconstruction probe: a model trained using only public relational features (joined on lead_id/account_id/contact_id) achieves AUC ≤ 0.65 against converted_within_90_days. Threshold matches the existing scripts/probe_relational_leakage.py --max-accuracy 0.65 posture used for the structural sweep on the alpha bundles; honest relational features (per-lead opportunity counts and ACV aggregates) carry signal but should not solo-dominate the task.
G4.6 Manifest field relational_snapshot_safe == true for student_public bundles; false for research_instructor.

Direct leakage gate

G5.1 Models trained using only post-snapshot aggregate features (total_touches_all, the v1 leakage trap) achieve AUC ≤ 0.95 on the test split. Observed median across seeds: ~0.54–0.55 per tier (max ~0.62). The trap is meant to be predictive — the band only flags total-domination scenarios.
G5.2 Models trained using only suspect-stage columns (current_stage, is_sql) achieve AUC ≤ 0.95 when present. Both columns are redacted under the student_public exposure mode; the gate is therefore effectively skipped on public bundles, but the band is declared for the instructor companion's full-horizon export.
G5.3 ID-only models (using only lead_id/account_id/contact_id) achieve AUC ≤ 0.60. Observed median per tier ~0.49–0.51 (max ~0.56); the 0.60 ceiling admits stratified-CV variance without green-lighting genuine ID-encoded leakage.
G5.4 No public feature derives from events with timestamp > lead_created_at + snapshot_day (audited at the FeatureSpec level — recipe must declare provenance).

Split leakage gate

G6.1 Account-overlap audit: same account_id in train + test is documented as intentional or absent.
G6.2 Contact-overlap audit: same contact_id in train + test is documented as intentional or absent.
G6.3 Near-duplicate row detection: no rows with feature-vector cosine similarity > 0.99 across splits.
G6.4 Cohort-time-shift split exists: AUC degradation under cohort split lies within [-0.05, 0.10]. Observed range across tiers is roughly [-0.02, 0.02] — v1's bundles are roughly IID-balanced over the 90-day horizon (no time-of-year drift baked in), so the gate is informational in v1 rather than discriminating. v2 will explicitly inject seasonality / quarterly close cycles to make the gate bite; the lower bound stays loose for v1.

Performance gates (per tier)

Bands fitted to the PR 3.3 N=5 sweep on release/{intro,intermediate,advanced}/. All numeric bands live in v1_acceptance_gates_bands.yaml; medians and rationale follow.

These bands are regression fences, not realism thresholds. They are calibrated to the observed five-seed spread for this DGP and recipe configuration. A band being "wide" does not mean any value within it is equally realistic — it means the validator will not flag a new bundle as broken unless a metric drifts outside that window. The medians in each gate note are the meaningful targets; bands only fire on substantial unintended regressions. Tightening the bands is expected work when the DGP is redesigned for v2.

Intro tier

G7.1.1 Conversion rate within [0.24, 0.61]. Median 0.4267.
G7.1.2 LR AUC within [0.82, 0.94]. Median 0.8788.
G7.1.3 GBM AUC within [0.82, 0.92]. Median 0.8729.
G7.1.4 GBM-vs-LR AUC delta within [-0.05, 0.05]. Median -0.0045. See G7.4.4 for the cross-tier sign concern.
G7.1.5 Average Precision (LR) within [0.62, 0.90]. Median 0.7608.
G7.1.6 P@100 within [0.65, 0.95]. Median 0.80.
G7.1.7 Brier score ≤ 0.17. Median 0.1301.
G7.1.8 Calibration max-bin error ≤ 0.65. Median 0.2497. Calibration metrics are noisy at small per-bin n; the band reflects observed spread, not a tightness claim.

Intermediate tier

G7.2.1 Conversion rate within [0.12, 0.31]. Median 0.2160.
G7.2.2 LR AUC within [0.84, 0.93]. Median 0.8859.
G7.2.3 GBM AUC within [0.82, 0.93]. Median 0.8755.
G7.2.4 GBM-vs-LR AUC delta within [-0.04, 0.03]. Median -0.0072.
G7.2.5 Average Precision (LR) within [0.40, 0.75]. Median 0.5752.
G7.2.6 P@100 within [0.45, 0.75]. Median 0.59.
G7.2.7 Brier score ≤ 0.14. Median 0.1096.
G7.2.8 Calibration max-bin error ≤ 0.90. Median 0.2490.

Advanced tier

G7.3.1 Conversion rate within [0.04, 0.12]. Median 0.0840.
G7.3.2 LR AUC within [0.81, 0.97]. Median 0.8861.
G7.3.3 GBM AUC within [0.84, 0.91]. Median 0.8726.
G7.3.4 GBM-vs-LR AUC delta within [-0.06, 0.04]. Median -0.0133.
G7.3.5 Average Precision (LR) within [0.19, 0.52]. Median 0.3514.
G7.3.6 P@100 within [0.20, 0.55]. Median 0.34.
G7.3.7 Brier score ≤ 0.09. Median 0.0611.
G7.3.8 Calibration max-bin error ≤ 1.0. Median 0.5234. Class imbalance inflates per-bin variance; the band admits the observed range without green-lighting total miscalibration.

Cross-tier ordering

G7.4.1 AP ordering: intro > intermediate > advanced. Holds.
G7.4.2 P@K ordering: intro > intermediate > advanced. Holds.
G7.4.3 Conversion-rate ordering: intro > intermediate > advanced. Holds.
G7.4.4 GBM-vs-LR delta is positive in every tier (sophistication is rewarded). Known finding (v1 → v2). Observed median delta is slightly negative in every tier (intro -0.0045, intermediate -0.0072, advanced -0.0133): v1's snapshot is dominated by linear features (engagement aggregates + firmographics) and a HistGBM does not consistently beat a regularised logistic regression at this signal level. The PR 3.3 driver gates on the per-tier gbm_minus_lr_auc bands (G7.1.4 / G7.2.4 / G7.3.4) rather than the cross-tier sign check; v2 will introduce non-linear interactions in the simulator (saturation curves, threshold effects) so the gate bites. Tracked in the post-v1 roadmap.

Cross-seed stability gate

G8.1 Run N=5 seeds per tier; the max-min spread of each headline metric stays under the per-metric ceiling: LR/GBM AUC ≤ 0.06; GBM−LR delta ≤ 0.05; LR Average Precision ≤ 0.13; Brier score ≤ 0.04; conversion rate ≤ 0.15. Calibration max-bin error is intentionally not bounded here — its per-bin-n noise dominates the cross-seed signal at v1's class balances.
G8.2 No degenerate seeds (conversion rate < 1% or > 99% in any seed).

Public/instructor diff gate

G9.1 Every public/instructor difference is intentional and listed in release/EXPOSURE_DELTA.md.
G9.2 Manifest redacted_columns field matches the actual public bundle's column omissions.
G9.3 Instructor-companion-only artifacts (metadata/, leakage-trap features, full-horizon tables) are absent from public bundles.

Documentation gate

G10.1 release/README.md (the dataset card) passes a Datasheets-for-Datasets / Data Cards Playbook checklist:
- Provenance (who, when, why)
- Motivation
- Composition (entities, features, label, splits)
- Collection / generation method
- Preprocessing and transformations
- Recommended uses
- Out-of-scope uses
- Known limitations and biases
- Maintenance plan
G10.2 docs/release/generation_method.md exists and is readable as a standalone document.
G10.3 docs/release/feature_dictionary.md covers every feature in the snapshot CSV with description, dtype, source, leakage flag, and recommended-for-modeling flag.
G10.4 docs/release/break_me_guide.md exists and links from release/README.md.
G10.5 docs/release/v1_release_notes.md exists and is human-readable.
G10.6 Every claim made in the dataset card about realism, calibration, or difficulty has a backing reference in release/validation/validation_report.md.

Platform packaging gate

Kaggle

G11.1 release/kaggle/dataset-metadata.json exists and validates against current Kaggle schema:
- title length 6-50 chars
- subtitle length 20-80 chars
- id slug 3-50 chars
- exactly one entry in licenses
- expectedUpdateFrequency from approved values (never for v1)
- all resources[].schema.fields listed in column order
G11.2 release/dataset-cover-image.png exists with dimensions ≥ 560 × 280.
G11.3 Kaggle dry-run package builds without error: kaggle datasets create -p release/kaggle --dir-mode zip (in --dry-run if available, or shape-validate without).

Hugging Face

G12.1 release/huggingface/README.md exists with valid YAML metadata: pretty_name, license, language: en, task_categories: [tabular-classification], size_categories, tags, configs.
G12.2 Exactly one config has default: true.
G12.3 Local load_dataset(release/huggingface, "intro") succeeds; same for intermediate, advanced.
G12.4 Companion repo (leadforge-lead-scoring-v1-instructor) packages independently and loads via load_dataset() for at least one config.

Notebook gate

G13.1 All four notebooks in release/notebooks/ execute top-to-bottom from a clean environment without errors.
G13.2 Each notebook's printed metrics match the validation report within tolerance ±0.05 on AUC / AP / P@K and ±0.05 on Brier (out of scope for PR 3.3; set when notebooks land in Phase 6).
G13.3 Each notebook explicitly distinguishes the public path from the instructor companion path; instructor-only artifacts are not loaded by the public notebooks.

LLM critique gate

G14.1 scripts/run_llm_critique.py runs successfully when credentials are present.
G14.2 The critique produces a structured findings JSON conforming to the schema in v1_release_design.md §"LLM critique".
G14.3 No unresolved high-severity findings remain. Each high-severity finding is either:
- resolved in code (with a backing PR), or
- documented in docs/release/v2_decision_log.md as intentional-and-accepted with rationale.
G14.4 Raw LLM outputs are archived under release/validation/llm_critique_raw_*.json for audit.

Adversarial framing gate

G15.1 GitHub issue templates (dataset_breakage_report.yml, realism_feedback.yml) render correctly.
G15.2 docs/release/break_me_guide.md is linked from release/README.md, the Kaggle description, and the HF README.
G15.3 docs/release/v2_decision_log.md exists (may be empty at launch).

Out-of-scope acknowledgment

The following are explicitly NOT release blockers for v1; they live in post_v1_roadmap.md:

Channel-conditional MQL→SQL rates (audit only in v1).
Log-normal sales-cycle distributions.
Demographic noise injection.
Quantitative semantic-diversity validator.
Multi-provider LLM critique CI integration.
LTV labels as first-class outputs.
Second vertical / per-vertical calibration.
Leaderboard mini-site.

Definition of green

A release candidate is green (ready to publish) when:

All gates G1–G15 pass.
The validation report explicitly cites the gate that justifies each metric band.
A human signs off on v2_decision_log.md entries for any accepted-with-rationale findings.

A release candidate is blocked if any of:

G4.* relational leakage gate fails.
G5.* direct leakage gate fails.
G7.4.4 GBM-vs-LR delta is non-positive in every tier and the per-tier gbm_minus_lr_auc bands have not been re-tuned to fit the new dataset (i.e. the dataset has degraded; v1's known-finding posture is not a free pass for future regressions).
G14.3 has unresolved high-severity findings.

Naming and versioning gate​

Reproducibility gate​

Structural gate​

Relational leakage gate (the v1 critical gate)​

Direct leakage gate​

Split leakage gate​

Performance gates (per tier)​

Intro tier​

Intermediate tier​

Advanced tier​

Cross-tier ordering​

Cross-seed stability gate​

Public/instructor diff gate​

Documentation gate​

Platform packaging gate​

Kaggle​

Hugging Face​

Notebook gate​

LLM critique gate​

Adversarial framing gate​

Out-of-scope acknowledgment​

Definition of green​