GenAI Foundations / Advanced Track Module 13 / 15
GenAI Foundations Advanced ⏱ 45 min
DEVQABAPM

AI Governance: Guardrails, Prompt-Leak Defense, and Oversight

Implement governance controls that prevent data leaks, unsafe actions, and silent policy violations in agentic systems.

How to Use This Lesson

  • Start with the user problem, then map the pattern to architecture and failure modes.
  • If a code or design example is included, change one assumption and reason through the impact.
  • Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Prerequisites: advanced/04-security-prompt-injection, advanced/12-agent-evaluation-harness-trace-grading

Governance Is a Runtime Architecture

AI governance is not a paragraph in the system prompt. It is the combination of policy, controls, evidence, accountability, and review. For enterprise agents, governance must be enforced before input reaches the model, before tools execute, before output leaves the system, and after incidents occur.

Defense-in-Depth Guardrails

flowchart LR
I[Input] --> P1[Input Policy Filter]
P1 --> R[Runtime Policy Engine]
R --> M[Model Call]
M --> T[Tool Call Gate]
T --> O[Output Safety Filter]
O --> A[Audit and Evidence Store]
A --> G[Governance Review]
Code copied! Link copied!

Governance Frameworks to Know

Interview-ready answers should reference practical frameworks without turning the answer into legal advice:

FrameworkWhy it matters
NIST AI RMFRisk map, measure, manage, govern lifecycle
ISO/IEC 42001AI management system expectations
EU AI ActRisk-based controls for AI systems in the EU
SOC 2 / ISO 27001Security and operational controls around AI systems
OWASP LLM Top 10Common LLM application security failure modes

Use these to structure product requirements: risk classification, documentation, human oversight, monitoring, incident response, and change management.

Policy-as-Code

Put non-negotiable rules in deterministic code. The model can explain and reason, but the runtime decides whether an action is allowed.

type ToolRequest = {
  actorId: string;
  tenantId: string;
  tool: string;
  args: Record<string, unknown>;
  dataClasses: Array<"public" | "internal" | "pii" | "secret" | "regulated">;
};

type PolicyDecision =
  | { decision: "allow" }
  | { decision: "deny"; reason: string }
  | { decision: "approval_required"; reason: string; approverGroup: string };

export function decide(req: ToolRequest): PolicyDecision {
  if (req.dataClasses.includes("secret")) {
    return { decision: "deny", reason: "secret_data_not_allowed_in_llm_path" };
  }

  if (req.tool === "refund.issue" && Number(req.args.amountUsd) > 500) {
    return {
      decision: "approval_required",
      reason: "high_value_refund",
      approverGroup: "finance_ops"
    };
  }

  if (req.tool.endsWith(".delete")) {
    return { decision: "approval_required", reason: "destructive_action", approverGroup: "admin" };
  }

  return { decision: "allow" };
}

Prompt-Leak Defense

Prompt leaks happen when users or retrieved documents coax the model into revealing system instructions, hidden policies, credentials, or internal chain-of-thought. Good defenses are layered:

  • Never put secrets in prompts.
  • Keep system prompts short and non-sensitive.
  • Treat retrieved documents as untrusted instructions.
  • Use output filters for prompt disclosure patterns.
  • Store sensitive policy in code or server-side configuration, not natural language prompts.
  • Return concise reasoning summaries instead of hidden chain-of-thought.
LEAK_PATTERNS = [
    "system prompt",
    "developer message",
    "hidden instructions",
    "ignore previous instructions",
    "print your policy",
]

def screen_output(text: str) -> tuple[bool, str | None]:
    lower = text.lower()
    for pattern in LEAK_PATTERNS:
        if pattern in lower:
            return False, f"possible_prompt_leak:{pattern}"
    return True, None

Output screening is not sufficient by itself, but it catches common failures and creates evidence for tuning.

Guardrail Placement

LayerExample control
InputPrompt-injection classifier, PII detector, file type allowlist
RetrievalSource trust ranking, document sanitization, tenant filtering
PlanningPolicy-aware tool selection and approval prediction
Tool executionAuthz, schema validation, idempotency, rate limits
OutputPII redaction, citation checks, refusal templates
MonitoringDrift alerts, incident review, audit exports

OWASP LLM Top 10 Mapping

Common enterprise risks include prompt injection, sensitive information disclosure, insecure output handling, excessive agency, overreliance, vector-store poisoning, and supply-chain risk. Map each risk to a control and an eval case.

risk_register:
  - risk: prompt_injection_indirect
    control: retrieval_sanitization_and_instruction_hierarchy
    eval_suite: evals/security/indirect_injection.yaml
  - risk: excessive_agency
    control: policy_engine_and_human_approval
    eval_suite: evals/security/high_risk_tools.yaml
  - risk: pii_leakage
    control: data_classification_and_output_redaction
    eval_suite: evals/security/pii_redaction.yaml

Governance Evidence

For audits and incident response, retain evidence without retaining unnecessary sensitive content:

  • Prompt template version and model version.
  • Tool name, risk tier, decision, and approver.
  • Policy decision and reason.
  • Eval suite version that approved the release.
  • Redacted trace IDs and incident links.
  • Data classification labels, not raw secrets.
⚙️ For Developers

Build guardrails as runtime middleware and policy services. Prompts can describe policy, but code must enforce policy.

🧪 For QA Engineers

Maintain adversarial suites for prompt leaks, cross-tenant data access, indirect prompt injection, unsafe tool calls, and output redaction failures.

🎯 For Product Managers

Define critical-action taxonomies with legal, compliance, and operations before launch. Governance failures are product failures.

Production Gotcha

If governance can be disabled by a feature flag on high-risk paths, delivery pressure will eventually bypass it. Make core controls non-bypassable.

Interview Practice

  1. Why is AI governance more than a system prompt?
  2. How would you map OWASP LLM risks to concrete runtime controls?
  3. What belongs in policy-as-code instead of prompt instructions?
  4. How do you defend against prompt leaks without storing secrets in prompts?
  5. What governance evidence should be retained for an audit?
  6. How should human approval integrate with guardrails?
  7. What is excessive agency, and how do you constrain it?
  8. How do frameworks like NIST AI RMF or ISO 42001 influence product requirements?