MP-301f · Module 2

Data Lineage & Classification

4 min read

Data lineage tracks where data came from, how it was transformed, and where it went. In an MCP federation, lineage answers questions like: "The AI told a customer their contract expires on March 15 — which system did that date come from, and who last updated it?" Without lineage, you cannot audit AI-generated outputs, investigate errors, or satisfy regulatory inquiries. Every MCP resource response should carry lineage metadata — source system, source record ID, timestamp of the source read, and any transformations applied.

Data classification labels each resource and field by sensitivity level. A common four-tier scheme: Public (anyone can see), Internal (employees only), Confidential (need-to-know basis), and Restricted (regulatory protection — PII, PHI, PCI). Classification drives access control, masking, and retention decisions. The MCP server applies classification at the resource level (this entire resource is Confidential) and at the field level (the SSN field within this resource is Restricted, everything else is Internal).

Lineage and classification work together to create a complete governance picture. When an AI model reads a Restricted resource, the lineage log records who accessed what, when, and through which MCP server. When an auditor asks "did anyone access PII through the AI system last quarter?", the classification tags identify the relevant resources and the lineage logs identify the access events. This combination is required for SOC 2, HIPAA, and GDPR compliance.

type Classification = "public" | "internal" | "confidential" | "restricted";

interface LineageMetadata {
  source: string;           // "salesforce", "servicenow"
  sourceRecordId: string;   // Native record ID
  readTimestamp: string;    // ISO timestamp of the source read
  transformations: string[]; // ["field_mapping", "pii_masking"]
  classification: Classification;
  fieldClassifications?: Record<string, Classification>;
}

// Attach lineage to every resource response
function withLineage<T>(
  data: T,
  lineage: LineageMetadata
): { data: T; _lineage: LineageMetadata } {
  // Log the access for audit trail
  auditLog.write({
    event: "resource_read",
    source: lineage.source,
    recordId: lineage.sourceRecordId,
    classification: lineage.classification,
    timestamp: new Date().toISOString(),
    userId: getCurrentUserId(),
  });

  return { data, _lineage: lineage };
}

// Usage in a resource handler
server.resource("customer", 
  new ResourceTemplate("sf://customers/{id}", { list: undefined }),
  async (uri, { id }) => {
    const raw = await salesforce.query(`SELECT ... FROM Account WHERE Id = '${id}'`);
    const masked = applyMasking(raw, "confidential");
    const response = withLineage(masked, {
      source: "salesforce",
      sourceRecordId: id,
      readTimestamp: new Date().toISOString(),
      transformations: ["field_mapping", "pii_masking"],
      classification: "confidential",
      fieldClassifications: {
        ssn: "restricted",
        email: "confidential",
        name: "internal",
        industry: "public",
      },
    });
    return {
      contents: [{ uri: uri.href, mimeType: "application/json",
        text: JSON.stringify(response) }],
    };
  }
);

Define classification tiers Establish a 3-4 tier classification scheme (Public, Internal, Confidential, Restricted). Map each tier to access control rules, masking requirements, and retention periods.
Classify every resource and field Walk through each MCP resource and assign classifications at the resource and field level. Prioritize tables with customer data, financial data, and authentication credentials.
Attach lineage to responses Add a _lineage metadata block to every resource response with source, timestamp, transformations, and classification. Log every access to an append-only audit store.