CDX-301e · Module 2

Dataset Processing & Pipelines

3 min read

Dataset processing uses Codex Cloud as a data pipeline engine. Each microVM processes a partition of the data — transforming, validating, enriching, or analyzing it — and produces structured output that is aggregated in the reduce step. This pattern is useful for code-adjacent data tasks: processing configuration files across environments, analyzing log files for patterns, generating test fixtures from production data schemas, or validating data migrations.

The key constraint is data ingestion. Cloud microVMs start with a repository clone, not arbitrary data. To process external datasets, the data must either live in the repository (checked in or via LFS), be fetchable from an allowlisted endpoint, or be generated synthetically from a schema. For large datasets, the recommended pattern is pre-processing: split the dataset into partitions, check each partition into a branch, and submit tasks that process their assigned partition. The results are collected as structured output files in each branch.

# Dataset pipeline: validate and transform config files
# Step 1: partition configs by environment
ENVS=(dev staging prod us-east eu-west ap-south)

for env in "${ENVS[@]}"; do
  codex cloud "validate all config files in config/${env}/. \
    Check for: missing required fields, type mismatches, \
    deprecated keys, and cross-reference inconsistencies. \
    Output results as JSON to validation-report-${env}.json" &
done
wait

# Step 2: aggregate reports
codex cloud "read all validation-report-*.json files. \
  Produce a summary table: environment, total configs, \
  errors found, error categories, severity counts. \
  Output as markdown to VALIDATION-SUMMARY.md"

Do This

Use structured output formats (JSON, CSV) so the reduce step can parse programmatically
Partition datasets along natural boundaries (environment, region, module) to maintain context
Validate each partition's output schema before aggregation to catch parsing errors early

Avoid This

Process multi-gigabyte datasets in Codex Cloud — it is optimized for code, not data volume
Use unstructured text output from parallel tasks — aggregation becomes string parsing hell
Skip the aggregation step — raw partition outputs without synthesis are not actionable