Skip to main content

Bundle schema reference

The bundle schema version is stamped in manifest.json as bundle_schema_version. The current version is 5.

manifest.json schema

{
"bundle_schema_version": 5,
"package_version": "1.0.0",
"recipe_id": "b2b_saas_procurement_v1",
"seed": 42,
"generation_timestamp": "2026-05-27T00:00:00Z",
"exposure_mode": "student_public",
"difficulty_profile": "intermediate",
"table_inventory": {
"accounts": 1500,
"contacts": 4200,
"leads": 5000,
"touches": 38421,
"sessions": 19847,
"sales_activities": 12034,
"opportunities": 1421
},
"file_hashes": {
"tables/leads.parquet": "sha256:abc123..."
}
}

task_manifest.json schema

{
"task_id": "converted_within_90_days",
"label_column": "converted_within_90_days",
"label_window_days": 90,
"primary_table": "leads",
"split": { "train": 0.7, "valid": 0.15, "test": 0.15 },
"description": "..."
}

Entity ID format

All entity IDs are deterministic, zero-padded strings:

EntityFormatExample
Accountacct_NNNNNNacct_000001
Contactcont_NNNNNNcont_000042
Leadlead_NNNNNNlead_002501
Touchtouch_NNNNNNtouch_019844
Sessionsess_NNNNNNsess_005002
Sales activitysact_NNNNNNsact_007311
Opportunityoppt_NNNNNNoppt_000893

IDs are stable: the same (recipe, seed, entity index) always produces the same ID.

Parquet conventions

  • All tables use snappy compression.
  • Timestamps are stored as datetime64[us, UTC].
  • Nullable integers use Int64 (pandas nullable dtype), not float64.
  • Boolean columns use bool, not int.