Field Guide
AI & Governance

Chapter 13: Beyond the Hype — Practical AI in Legal

Generative versus Agentic AI, the RAG architecture for building closed-loop data ecosystems that maximise accuracy, and the practical deployment of AI in 2026 legal operations.

The AI Landscape in 2026

Generative AI is table stakes. By 2026, every major legal technology vendor has embedded large language model (LLM) capabilities into their platform. Document summarisation, clause extraction, first-draft generation, and natural language search are commodity features. The conversation has moved well beyond "whether to adopt AI" — the strategic question is now how to deploy it for maximum operational impact.

The cutting edge has moved to Agentic AI — AI systems that do not merely generate content in response to a prompt but autonomously execute multi-step workflows, make routing decisions, interact with external systems, and adapt their approach based on intermediate results. Understanding the distinction between generative and agentic AI, and deploying each appropriately, is the defining competency of Legal Ops in the intelligence era.

Generative vs. Agentic AI

Generative AI: The Content Engine

Generative AI takes an input (a prompt, a document, a dataset) and produces an output (a summary, a draft, a classification). The interaction is single-turn: human provides input, AI produces output, human evaluates. Every instance requires human initiation and human review.

Legal applications of generative AI (mature, widely deployed in 2026):

  • Document summarisation: Condensing lengthy contracts, judgments, or regulatory filings into structured summaries highlighting key terms, obligations, and risks
  • First-draft generation: Producing initial drafts of standard agreements, memos, and correspondence based on templates, playbooks, and matter context
  • Clause extraction and classification: Identifying and categorising specific clauses across a portfolio of contracts for analytics, migration, or compliance review
  • Research assistance: Synthesising legal research across jurisdictions, identifying relevant precedents, and drafting preliminary legal analysis
  • Translation and localisation: Converting legal documents between languages with domain-specific accuracy

Agentic AI: The Workflow Orchestrator

Agentic AI operates autonomously across multiple steps, making decisions at each stage based on predefined rules, contextual data, and intermediate results. The interaction is multi-turn and autonomous: human defines the objective and constraints, the AI executes a sequence of actions, and human actively oversees execution at defined checkpoints and supports resolution of edge cases.

Legal applications of agentic AI (emerging, high-impact in 2026):

  • End-to-end NDA processing: An AI agent receives an incoming NDA, compares it against the organisation's playbook, classifies it by risk tier, generates redlines for non-standard terms, routes it for the appropriate level of review (or auto-approves if within parameters), and sends the response to the counterparty — with human review only for Amber and Red classifications
  • Regulatory change management: An agent monitors regulatory sources across defined jurisdictions, identifies changes relevant to the organisation, assesses the impact against existing policies and contracts, generates a preliminary impact report, and routes it to the responsible lawyer for review and action
  • Due diligence orchestration: An agent ingests a data room, classifies documents by type, extracts material terms from each category using specialised extraction models, flags anomalies and risks, compiles a preliminary due diligence report, and identifies items requiring human review
  • Compliance monitoring: An agent continuously scans contract obligations, regulatory deadlines, and policy requirements, identifies items approaching their due date, escalates overdue items, and generates compliance status reports
DimensionGenerative AIAgentic AI
Interaction modelSingle-turn: prompt → responseMulti-turn: objective → autonomous execution
Human involvementRequired at every stepRequired at defined checkpoints
Decision-makingHuman decides; AI produces contentAI decides within parameters; human oversees
Error profileHallucination in output contentIncorrect routing, missed edge cases, cascading errors
Data requirementInput document(s) + promptNormalised data ecosystem + integration layer + rules engine
Maturity in legal (2026)Mature, widely deployedEmerging, high-potential, requires careful governance

Warning

Agentic AI amplifies both capability and risk. An agent that processes 500 NDAs per month with 98% accuracy also makes 10 errors per month at production speed — errors that propagate downstream if the governance framework does not catch them. The higher the autonomy, the more critical the Human-in-the-Loop (HITL) checkpoints described in Chapter 14.

The RAG Architecture: Building the Closed-Loop Ecosystem

The Hallucination Problem

A critical challenge with LLMs in legal contexts is hallucination — the generation of plausible but factually incorrect content. An LLM asked to summarise a contract may "invent" a clause that does not exist. An LLM asked about a regulatory requirement may confidently cite a provision that was repealed three years ago. In legal work, where accuracy determines outcomes, hallucination requires active mitigation through architectural design.

Retrieval-Augmented Generation (RAG) is the architectural pattern that addresses hallucination by constraining the AI's responses to information retrieved from a curated, authoritative data source.

How RAG Works

The RAG architecture has three components:

1. The Knowledge Base (Retrieval Source). A curated corpus of authoritative documents — the organisation's contracts, playbooks, policies, templates, legal memos, and regulatory texts. These documents are processed (chunked, embedded, and indexed) into a vector database that enables semantic search.

2. The Retrieval Engine. When a user poses a query, the retrieval engine searches the knowledge base for the most relevant document chunks. Relevance is determined by semantic similarity — the engine finds content that is conceptually related to the query, not just keyword matches.

3. The Generation Model. The LLM receives the user's query along with the retrieved document chunks as context. It generates its response based on this retrieved context rather than its general training data. Critically, the response includes citations — references to the specific source documents that informed the answer.

The Closed-Loop Principle

The "closed loop" means the AI draws exclusively on data that the legal team has validated and curated. The system references only clauses that exist in the contract repository, only regulations in the regulatory database, and only negotiation positions in the playbook. This constraint is the entire value proposition of RAG for legal: bounded accuracy within trusted data sources.

The system's quality directly reflects its knowledge base. A RAG system with a comprehensive, current knowledge base delivers complete, accurate answers. A RAG system with gaps in its data will reveal those gaps accurately. This is why the data normalisation work in Chapter 12 is the prerequisite for RAG deployment — building the knowledge base directly builds the system's reliability.

RAG Implementation Priorities

Priority 1: Define the knowledge base scope. What documents should the RAG system draw from? Start narrow — a playbook, a policy set, a contract portfolio — and expand as the system proves reliable. A narrow, high-quality knowledge base outperforms a broad, inconsistent one.

Priority 2: Invest in chunking and embedding quality. How documents are divided into chunks and how those chunks are numerically represented (embedded) determines retrieval accuracy. Poor chunking (splitting a clause mid-sentence, separating a definition from its context) produces poor retrieval. This is a technical investment that has outsized impact on system quality.

Priority 3: Build evaluation and feedback loops. Every RAG deployment should include a mechanism for users to rate response quality and flag errors. This feedback drives continuous improvement — identifying knowledge base gaps, retrieval failures, and generation issues that the team can address iteratively.

Priority 4: Implement citation requirements. Every AI-generated response must cite its source documents. This enables human reviewers to verify the response against the original source, catch retrieval errors, and maintain the accountability that legal work product demands.

Strategic Insight

RAG shifts human review from generative work to verification work. Instead of "generate the answer from scratch," human reviewers focus on "verify the AI's answer against cited sources." This fundamentally more efficient cognitive task unlocks significant leverage. A lawyer reviewing an AI-generated contract summary with cited clause references verifies accuracy in 5 minutes — instead of 45 minutes spent drafting the summary from scratch.

Practical Deployment: The AI Pilot Framework

Selecting the Pilot Use Case

The first AI deployment should meet four criteria:

High volume. The use case must involve enough transactions to generate meaningful data on AI performance. A use case that occurs three times per year does not produce sufficient volume for evaluation.

Low risk. Choose use cases where the consequences of an AI error can be contained. Contract summarisation offers built-in review opportunities; automated contract execution requires higher accuracy thresholds before errors can propagate.

Measurable baseline. The current manual process must have quantifiable performance metrics (cycle time, cost per unit, error rate) against which AI performance can be measured. Without a baseline, you cannot demonstrate improvement.

Enthusiastic users. The pilot user group must include individuals who are willing to use the tool and provide feedback. Engaging willing users creates momentum, generates valuable feedback, and builds organisational confidence in AI initiatives.

The Pilot Evaluation Framework

MetricDefinitionTarget
AccuracyPercentage of AI outputs that require no human correction>90% for initial pilot
Efficiency gainReduction in time per task compared to manual baseline>40%
User satisfactionPilot user rating of tool usefulness and reliability>7/10
Error severityClassification of errors by consequence (cosmetic, material, critical)Zero critical errors
Adoption ratePercentage of eligible tasks processed through the AI tool>70% by end of pilot

Scaling from Pilot to Production

A successful pilot (meeting or exceeding targets across all five metrics) proceeds to production deployment through a phased rollout:

Phase 1: Expand user base. Extend access from the pilot group to the full target audience, with intensified training and support.

Phase 2: Expand scope. Add adjacent use cases that share the same knowledge base and architecture. If the pilot was NDA review, expand to standard MSA review.

Phase 3: Increase autonomy. Build confidence in the system by progressively expanding the scope of autonomous decisions. Shift from "human reviews every output" to "human reviews flagged outputs" to "human reviews a statistical sample" as accuracy thresholds remain consistently strong.

In the Trenches

The Agent That Processed 200 NDAs Per Month

A global professional services firm deployed an agentic AI system for NDA processing in early 2025. The agent's workflow: receive incoming NDA from business development, compare against the firm's standard NDA playbook, classify as Green (standard — auto-process), Amber (minor deviations — generate redlines and route for junior counsel review), or Red (material issues — escalate to senior counsel).

The first three months revealed an opportunity: the agent classified 12% of NDAs differently than expected — mostly false Greens (NDAs classified as standard that contained non-standard terms). The diagnosis: the playbook data used to train the agent needed enrichment with additional clause variants that counterparties commonly used.

The team invested six weeks in playbook enrichment — adding 85 additional clause variants with their risk classifications. They also added a "confidence threshold" to the agent: if the agent's classification confidence fell below 85%, the NDA was automatically escalated to Amber regardless of the classification.

After enrichment, misclassification dropped to 2.1%. The agent processed an average of 200 NDAs per month, with 68% classified as Green (auto-processed), 26% as Amber (junior counsel review, average 20 minutes), and 6% as Red (senior counsel review). The firm estimated that the system freed approximately 140 hours of lawyer time per month — time that was redeployed to client-facing advisory work.

The critical success factor was not the AI technology. It was the quality of the playbook data — confirming, once again, that data quality determines AI effectiveness.

The Monday Morning Checklist

  • Classify your AI readiness. For your top 5 legal workflows by volume, assess: (1) Is the underlying data normalised and structured? (2) Is there a clear, documented process that the AI would follow? (3) Is there a measurable baseline for current performance? Workflows that answer "yes" to all three are AI-ready. Workflows that require foundation work become clear priority targets.
  • Identify your RAG knowledge base starting point. What is the most curated, authoritative, and well-maintained corpus of legal documents in your organisation? This — whether it is a playbook, a policy set, or a template library — is your RAG knowledge base seed.
  • Select one pilot use case. Apply the four selection criteria: high volume, low risk, measurable baseline, enthusiastic users. Commit to a 90-day pilot with defined evaluation metrics.
  • Establish your error management framework. For the selected pilot use case, determine: what is the consequence of an AI error? If the consequence is manageable (a summary that needs correction), proceed with appropriate HITL checkpoints. If the consequence is significant (an auto-executed contract with incorrect terms), maintain human review checkpoints across the full cycle until accuracy proves itself reliably.