AS-301i · Module 2

Model Behavior Reconstruction

4 min read

The central question in AI forensics is: why did the model produce this output? The model itself cannot explain its reasoning — it produces plausible-sounding explanations, not factual accounts of its processing. Behavior reconstruction works from the observable evidence: the input, the context, the configuration, and the output. By analyzing what went in and what came out, the forensic analyst reconstructs the most likely explanation for the model's behavior without relying on the model's own account.

  1. Input-Output Mapping Lay out the complete input — system prompt, conversation history, retrieved context, and user message — alongside the model's output. Identify which elements of the input influenced which elements of the output. If the output contains information from a retrieved document, the retrieval is part of the behavior chain. If the output follows instructions from the user message that conflict with the system prompt, the injection succeeded.
  2. Counterfactual Analysis To understand why the model behaved as it did, construct counterfactual scenarios: what would the model have produced without the suspected injection? Without the retrieved document? Without the conversation history? Counterfactual analysis isolates the specific input element that caused the anomalous behavior. [RECOMMEND]: Use a clean model instance for counterfactual testing — do not reuse the potentially compromised instance.
  3. Configuration Analysis Compare the model's configuration during the incident — system prompt version, temperature setting, tool permissions, guardrail configuration — against the documented baseline. A configuration change that coincides with the incident may be the root cause. Unauthorized prompt modifications, relaxed guardrail settings, or expanded tool permissions explain behavioral changes without requiring an external attack.