PE-301a · Module 1
Building the Training Dataset
3 min read
The training dataset is the historical closed-deal data from which the model learns. Every row is a deal. Every column is a feature. The target variable is the outcome: 1 for closed-won, 0 for closed-lost. The quality of this dataset determines the quality of the model — garbage data produces a garbage model regardless of how sophisticated the algorithm is.
Training Dataset Structure
deal_id | deal_size | industry | source | meetings_14d | email_resp | dm_engaged | days_in_stage | pushes | outcome
────────┼───────────┼─────────────┼──────────┼──────────────┼────────────┼────────────┼───────────────┼────────┼────────
D-001 | 85000 | Healthcare | Referral | 3 | 0.67 | Yes | 12 | 0 | 1
D-002 | 42000 | Technology | Inbound | 1 | 0.33 | No | 28 | 2 | 0
D-003 | 120000 | Finance | Outbound | 4 | 0.80 | Yes | 8 | 0 | 1
D-004 | 35000 | Retail | Inbound | 0 | 0.15 | No | 45 | 3 | 0
Minimum: 200 rows (100+ won, 100+ lost)
Target: 500+ rows for robust patterns