PE-301a · Module 1

Building the Training Dataset

3 min read

The training dataset is the historical closed-deal data from which the model learns. Every row is a deal. Every column is a feature. The target variable is the outcome: 1 for closed-won, 0 for closed-lost. The quality of this dataset determines the quality of the model — garbage data produces a garbage model regardless of how sophisticated the algorithm is.

Training Dataset Structure

deal_id | deal_size | industry    | source   | meetings_14d | email_resp | dm_engaged | days_in_stage | pushes | outcome
────────┼───────────┼─────────────┼──────────┼──────────────┼────────────┼────────────┼───────────────┼────────┼────────
D-001   | 85000     | Healthcare  | Referral | 3            | 0.67       | Yes        | 12            | 0      | 1
D-002   | 42000     | Technology  | Inbound  | 1            | 0.33       | No         | 28            | 2      | 0
D-003   | 120000    | Finance     | Outbound | 4            | 0.80       | Yes        | 8             | 0      | 1
D-004   | 35000     | Retail      | Inbound  | 0            | 0.15       | No         | 45            | 3      | 0

Minimum: 200 rows (100+ won, 100+ lost)
Target: 500+ rows for robust patterns