CDX-301e · Module 2
Dataset Processing & Pipelines
3 min read
Dataset processing uses Codex Cloud as a data pipeline engine. Each microVM processes a partition of the data — transforming, validating, enriching, or analyzing it — and produces structured output that is aggregated in the reduce step. This pattern is useful for code-adjacent data tasks: processing configuration files across environments, analyzing log files for patterns, generating test fixtures from production data schemas, or validating data migrations.
The key constraint is data ingestion. Cloud microVMs start with a repository clone, not arbitrary data. To process external datasets, the data must either live in the repository (checked in or via LFS), be fetchable from an allowlisted endpoint, or be generated synthetically from a schema. For large datasets, the recommended pattern is pre-processing: split the dataset into partitions, check each partition into a branch, and submit tasks that process their assigned partition. The results are collected as structured output files in each branch.
# Dataset pipeline: validate and transform config files
# Step 1: partition configs by environment
ENVS=(dev staging prod us-east eu-west ap-south)
for env in "${ENVS[@]}"; do
codex cloud "validate all config files in config/${env}/. \
Check for: missing required fields, type mismatches, \
deprecated keys, and cross-reference inconsistencies. \
Output results as JSON to validation-report-${env}.json" &
done
wait
# Step 2: aggregate reports
codex cloud "read all validation-report-*.json files. \
Produce a summary table: environment, total configs, \
errors found, error categories, severity counts. \
Output as markdown to VALIDATION-SUMMARY.md"
Do This
- Use structured output formats (JSON, CSV) so the reduce step can parse programmatically
- Partition datasets along natural boundaries (environment, region, module) to maintain context
- Validate each partition's output schema before aggregation to catch parsing errors early
Avoid This
- Process multi-gigabyte datasets in Codex Cloud — it is optimized for code, not data volume
- Use unstructured text output from parallel tasks — aggregation becomes string parsing hell
- Skip the aggregation step — raw partition outputs without synthesis are not actionable