Classifying, retaining, and retrieving legal information — the data foundations that enable AI, analytics, and compliance to work reliably.
## What Information Governance Actually Covers
Information Governance (IG) is the framework that defines how an organisation creates, classifies, retains, accesses, and disposes of its information assets. In legal operations, IG is the discipline that connects data quality, regulatory compliance, risk management, and technology enablement into a unified system.
A mature IG function encompasses:
**Classification and access control.** Every piece of legal information — contracts, advice, matter files, communications — has a classification level (public, internal, confidential, privileged) that determines who can access it, how it can be used, and how long it must be kept.
**Records management.** The systematic management of documents and data throughout their lifecycle: creation, active use, retention according to legal and business requirements, and eventual disposition. Legal departments face unique records management obligations — documents may need to be retained for litigation holds, regulatory investigations, or historical precedent long after a matter closes.
**Data quality and normalisation.** The technical foundation that makes every other capability work. Clean, structured, consistently formatted data is the prerequisite for compliance reporting, analytics, and AI-powered systems. Without data quality discipline, legal operations cannot reliably answer questions like “do we have a liability cap with this vendor?” or “what is our total contractual exposure?”.
**Metadata management.** The capture and maintenance of structured information about legal documents — who created it, when, for what purpose, with what classification, binding which parties. Metadata turns a document repository into an analytical asset.
**Integration and interoperability.** The middleware that connects source systems (CLM, matter management, e-billing) into a unified data environment where information flows automatically, consistently, and with appropriate controls.
**Audit and compliance.** The ability to demonstrate, at any point, that information is stored in the right place, classified correctly, accessible only to authorised users, and retained in accordance with legal and regulatory requirements.
Information Governance is not a compliance theatre or a burden imposed by risk-averse IT functions. It is the foundational discipline that makes AI deployment reliable, enables confident decision-making through analytics, unlocks the value locked in legacy contracts and matters, and protects the organisation from data breach, regulatory sanction, and litigation risk.
## The Legal Data Landscape
Legal data is disproportionately unstructured, inconsistently formatted, and distributed across systems for reasons that are structural rather than the result of any particular team’s shortcomings:
**Document-centric work product.** The primary output of legal work is documents — contracts, memos, opinions, filings — rich repositories of data embedded in natural language. Contract key terms (value, duration, governing law, liability cap) can be extracted from these documents and structured into queryable database fields through deliberate metadata capture and AI-assisted tools that legal teams are increasingly adopting.
**System fragmentation.** A typical legal department uses 5–8 distinct technology tools (matter management, document management, e-billing, CLM, e-signature, communication, calendar). Each system stores data in its own format, with its own field definitions, and its own identifiers. “Acme Corporation” in the CLM appears as “ACME Corp.” in the e-billing system and “Acme Corp Pty Ltd” in the matter management system. Without middleware and entity resolution, these separate records remain disconnected.
**Historical accumulation.** Legal departments carry decades of legacy data — paper files, scanned PDFs, obsolete matter management exports, departed lawyers’ email archives. This data has institutional value (precedents, historical positions, legacy obligations) and can be unlocked through structured indexing and AI-assisted metadata extraction.
**Data skills gap.** Lawyers are trained to work with text and legal reasoning rather than data management. Investing in data literacy across the legal function — through training, tools, and dedicated data roles — unlocks the analytical potential of legal department data and aligns legal operations with modern corporate functions.
**Regulatory and evidentiary requirements.** Legal information must often be retained for extended periods to satisfy litigation hold obligations, regulatory requirements, or corporate governance needs. The obligation to produce documents on demand means every piece of legal information may become material evidence. This creates a duty to maintain information in a state where its authenticity, integrity, and chain of custody can be demonstrated.
## Records Management Fundamentals
Records management is the practical discipline of applying retention schedules, managing legal holds, and ensuring documents are disposed of appropriately when their retention period expires.
### Document Retention Schedules
A retention schedule specifies how long each category of document must be kept, and the business or legal rationale for that period. Unlike general business records, legal documents often require extended retention periods.
**Typical retention tiers in legal:**
- **Permanent retention.** Corporate governance records (board minutes, constitutional documents), precedent-setting advice, significant client relationship documentation, and important transactional records (executed contracts of ongoing commercial significance).
- **7+ years.** Contracts with active or recently closed matters, billing and financial records (tax and audit requirements), employment and contractor agreements, litigation files. The “7 years” horizon is driven by statute of limitations for contract claims and tax law.
- **3–5 years.** General matter files, routine correspondence, work-in-progress matter documentation. Retention extends beyond matter close-date to account for potential follow-on disputes or related matters.
- **1–2 years.** Email and temporary communications, draft documents, administrative records, routine scheduling and logistical information.
- **Disposal on closure.** Ephemeral communications, meeting notes without binding content, temporary working documents.
A retention schedule must be:
- **Documented and communicated.** The schedule exists as a reference document, not an implicit assumption. All staff understand why documents are kept and when they can be deleted.
- **Applied consistently.** Automated workflows or checklists ensure documents are classified to the correct retention tier at the point of creation. Metadata in the document management system captures retention classification and triggers disposition processes automatically.
- **Defensible in litigation.** The schedule must reflect genuine business and legal requirements, not arbitrary decisions to minimise storage costs. If a document has been destroyed, discovery requests will ask why it was destroyed and when. A documented, reasonable retention schedule is your defence.
### Legal Hold Management
A legal hold is an instruction to preserve documents and information relevant to anticipated or ongoing litigation, investigation, or regulatory matter. A legal hold overrides normal retention schedules — documents subject to hold must be kept until the hold is released.
**The hold process:**
1. **Identify custodians.** Determine which employees, contractors, and former staff likely hold information relevant to the hold matter. Include people who may have been involved in the underlying events, even if they are no longer in active roles.
2. **Issue hold notices.** Communicate the hold obligation to custodians in writing. The notice must clearly explain what information is subject to hold, the business reason, the consequences of destruction, and instructions for identifying and preserving information (whether it remains in-place or is transferred to a centralised location).
3. **Preserve information.** Custodians suspend their normal deletion practices. Email is not automatically deleted. Documents are not pruned from file servers. If information is stored across multiple systems (local drives, cloud storage, phone backups), all locations are included in the preservation obligation.
4. **Track and verify.** The legal team or records manager tracks which custodians have acknowledged the hold, verifies that information is being preserved, and addresses any gaps (custodians who didn’t receive notice, information in systems not initially identified).
5. **Release and dispose.** When the litigation or investigation concludes, hold notices are released. Custodians resume normal deletion practices. Documents subject to hold but not to ongoing retention schedules are disposed of appropriately.
A well-managed hold process is essential because failure to preserve documents requested in discovery can result in sanctions, adverse inference instructions (the court instructs the jury to presume the destroyed documents would have been unfavourable to the destroying party), or case dismissal. For regulatory investigations, document destruction can itself become a separate violation.
### Disposition Processes
Disposition is the controlled, documented destruction or archival of information when its retention period expires. Disposition is not “hit delete” — it is a formal process.
**Sound disposition practice:**
- **Retention status is confirmed.** Before destruction, verify that the document’s retention period has actually expired and no legal hold is active.
- **Destruction is documented.** A disposition schedule or certificate records what was destroyed, when, and why. This creates an audit trail that demonstrates the organisation acted in good faith if destruction is later questioned.
- **Appropriate destruction method.** Documents are destroyed in a manner appropriate to their sensitivity (physical shredding for paper, certified data destruction for digital, certified deletion of secure data for sensitive electronic records).
- **Verification and sign-off.** Destruction is verified by independent personnel and documented with sign-off.
- **Archived materials are retained separately.** Materials of permanent or long-term value (corporate records, significant precedents) may be transferred to an archives function rather than destroyed, with appropriate access controls and metadata maintained.
Organisations that lack a disciplined disposition process tend to accumulate vast volumes of historical data. Beyond the practical problems (storage cost, system performance, backup complexity), this creates discovery risk — “we have everything, we’re not sure what is responsive” is not a safe litigation posture.
## Data Classification
Data classification defines who can access each piece of information and how it can be used. A classification scheme might be:
**Public.** Information intended for external communication. Published guidance, annual reports, press releases. No restriction on access or use.
**Internal.** Information used internally but not confidential. Policies, procedures, unclassified work product. Accessible to employees and contractors; not shared externally without approval.
**Confidential.** Information that would cause business harm if disclosed. Client information, commercial terms, business strategy, financial performance. Restricted to employees with a business need to know. Not shared externally without contractual confidentiality protections.
**Privileged / Legal Hold.** Information protected by legal privilege (solicitor–client privilege, litigation privilege, work product doctrine). Treated with the highest security. Disclosure may waive the privilege, so access is tightly controlled and disclosure practices are carefully managed. Often flagged with a legend (“Privileged and Confidential”) to signal the protection.
### Implementing Classification
Classification is most effective when it is:
**Automatic where possible.** Documents created in certain systems (CLM, matter management) can be auto-classified based on the context where they are created (documents in a litigation matter are confidential and may be privileged; documents in the templates library are internal). Templates can default to appropriate classifications.
**Enforced through workflow.** When documents are uploaded to a repository, classification is a required field. Users cannot proceed without selecting a classification level.
**Visible at point of use.** Classification is displayed prominently on documents so users understand the restrictions. Email headers, document headers, or metadata views show classification status.
**Enforced through access control.** The classification system is integrated with identity and access management. Privileged documents can only be accessed by explicitly authorised staff. Confidential documents are visible only to staff with a relevant business relationship.
**Auditable.** Access to classified documents is logged. If a sensitive document is accessed, the log record shows who accessed it, when, and what they did (viewed, downloaded, printed). Unusual access patterns (a user accessing hundreds of confidential documents outside their normal role) can be flagged for investigation.
### Practical Implementation Steps
1. **Develop a classification policy.** Define the tiers, the criteria for each tier, examples of documents that fall into each category, and the handling rules for each tier.
2. **Apply to existing data.** The challenge is often the legacy corpus. Apply classification to high-value documents first (executed contracts, key advice) and progressively extend to the broader repository. AI-assisted classification can speed this process, though human review is usually required for sensitive materials.
3. **Build classification into workflow.** New documents are classified at creation. Templates default to appropriate classifications. Matter creation or contract uploads trigger classification workflows.
4. **Connect to access control.** The classification system is integrated with the document management system, email, and identity management so that access rules are automatically enforced.
5. **Communicate and train.** Staff understand what each classification tier means, how it affects their work, and how to handle documents they encounter. Training is especially important around privilege — inadvertent disclosure can waive the privilege.
## Data Normalisation: The AI Prerequisite
Data normalisation is the technical foundation that transforms raw, inconsistent data into structured, consistent, analytically usable information. It is the essential prerequisite for AI deployment in legal.
### The Five-Step Normalisation Programme
**Step 1: Entity Resolution.** Establish a single, authoritative naming convention for every entity (client, counterparty, vendor, firm) that appears in your data. Map all variations to the canonical name. This is the prerequisite for any cross-system analysis — you cannot track total spend with a vendor if the vendor has three different names across three systems.
**Step 2: Taxonomy Standardisation.** Define standardised taxonomies for matter types, contract types, practice areas, and work categories. Adopt industry standards where they exist ([UTBMS codes](https://utbms.com/) for billing, [SALI standards](https://www.sali.org/) for matter types, [LEDES standards](https://www.ledes.org/) for e-discovery) and supplement with organisation-specific categories where needed. Apply the taxonomy consistently across all systems.
**Step 3: Metadata Enrichment.** For key data assets — particularly contracts — extract and structure the metadata embedded in document text into queryable, analytically useful fields. Party names, effective dates, expiry dates, contract values, governing law, key obligations, and material terms should exist as structured fields in the CLM or repository. AI-assisted extraction tools accelerate this work for legacy documents; new documents should have metadata captured at the point of creation through structured templates.
**Step 4: Deduplication and Reconciliation.** Identify and resolve duplicate records across systems, creating a single authoritative record for each entity. When the same matter exists in multiple systems with different data, consolidate and reconcile to establish one source of truth. This foundational work enables all downstream analytics and decision support.
**Step 5: Ongoing Governance.** Data normalisation is a continuous capability, not a one-time project. Establish data governance processes that maintain quality on an ongoing basis: defined data owners for each system, automated validation rules that catch quality issues at the point of entry, and periodic audits that verify data integrity.
### Why Normalisation Matters for AI
Data normalisation is the essential foundation for successful AI deployment in legal. Organisations that invest in this step before implementing AI unlock the full potential of their technology investments — the AI will produce outputs that reflect the quality of its inputs, which means consistent, reliable, and actionable results.
A machine learning model trained on inconsistent data (where entity names vary, taxonomies are non-standard, and metadata is missing) will learn to recognise the inconsistencies rather than the underlying patterns. When the model encounters new data, it struggles because the patterns it learned are artefacts of poor data quality, not genuine legal patterns. Conversely, an AI system querying clean, normalised data can focus on the substantive legal patterns and produce reliable, defensible insights.
### The Normalisation Investment
Data normalisation requires investment in time and effort, with returns that compound over time across all downstream applications. Frame the investment as an enabling capability that multiplies the value of all subsequent technology investments:
<table header-row="true">
<tr>
<td>Normalisation Step</td>
<td>Typical Effort</td>
<td>Prerequisite For</td>
</tr>
<tr>
<td>Entity resolution</td>
<td>2–4 weeks (one-time), ongoing governance</td>
<td>Cross-system reporting, vendor analytics, conflict checks</td>
</tr>
<tr>
<td>Taxonomy standardisation</td>
<td>4–6 weeks (one-time)</td>
<td>Matter analytics, benchmarking, trend analysis</td>
</tr>
<tr>
<td>Metadata enrichment</td>
<td>2–6 months (for legacy corpus)</td>
<td>CLM analytics, obligation management, AI-powered review</td>
</tr>
<tr>
<td>Deduplication</td>
<td>2–4 weeks per system pair</td>
<td>Reliable dashboards, accurate financial reporting</td>
</tr>
<tr>
<td>Ongoing governance</td>
<td>2–4 hours/week (permanent)</td>
<td>Sustaining all of the above</td>
</tr>
</table>
## The Three-Layer Data Architecture
### From Silos to Flow
Data normalisation makes individual systems reliable. The middleware layer makes them **interoperable**. Middleware — the integration technology that connects systems — enables data to flow between systems automatically, eliminating manual re-entry, reducing errors, and creating a unified data environment.
### The Legal Data Flow Architecture
A mature legal data architecture has three layers:
**Source systems.** The individual tools where data originates: CLM (contract data), matter management (matter data), e-billing (spend data), document management (work product), and enterprise systems (CRM, ERP, HR).
**Integration layer.** The middleware that moves data between source systems, applying transformation rules (normalisation, format conversion, entity mapping) in transit. This layer ensures that when a contract is executed in the CLM, the relevant data flows automatically to the CRM (deal closed), the ERP (payment terms activated), and the matter management system (matter status updated).
**Analytics layer.** The data warehouse or business intelligence platform where data from all source systems is consolidated for reporting and analysis. This is where financial performance is analysed, where spend trends are tracked, where cycle time metrics are computed, where legal value is measured, and where AI models are trained on high-quality, integrated legal information.
### The Single Source of Truth
The “single source of truth” is a data architecture where every data element has one authoritative source, and all other systems that consume that data receive it from that source. Contract terms are authoritative in the CLM. Spend data is authoritative in the e-billing system. Matter status is authoritative in the matter management system. The analytics layer reads from all authoritative sources, and the middleware ensures that authoritative data propagates to dependent systems.
When someone asks “what is our total contract exposure with Vendor X?”, the answer comes from one authoritative source, eliminating the confusion that arises from multiple spreadsheets maintained by different people showing conflicting numbers.
## Building the Single Source of Truth
A single source of truth is achieved through:
**Clear authoritative source designation.** For each data category, explicitly designate which system is authoritative. Document it. Communicate it to all staff. When someone asks “where is the definitive data on matter status?” or “what is the contract’s payment term?”, there is one agreed answer, not three competing spreadsheets.
**Middleware that enforces flow.** Once authoritative sources are designated, middleware ensures data flows from the authoritative source to dependent systems automatically, rather than through manual processes. Contract execution in the CLM triggers automatic updates to the CRM, matter management, and billing systems. No manual re-entry. No divergence.
**Integrated analytics.** The analytics layer pulls data from all authoritative sources and integrates it into a unified repository. This enables reporting and analysis that spans systems — total spend with a vendor across matters and jurisdictions, contract exposure against revenue, client profitability accounting for matter cost and contract margin.
**Reconciliation processes.** Periodic reconciliations identify divergence between systems (why is this contract showing different values in the CLM and the e-billing system?) and trigger investigation. Reconciliation is not a once-yearly audit — in a mature data architecture, reconciliations are continuous and automated, flagging anomalies in real time.
## Measuring Information Governance Maturity
Information Governance maturity can be assessed through key metrics that reflect the health of your data, the effectiveness of your processes, and your readiness for advanced applications like AI and analytics.
### Key Information Governance Metrics
**Data coverage percentage.** What proportion of your legal information assets are in structured, queryable systems versus unstructured repositories or offline storage? Target: \>85% of active contracts in CLM; \>80% of matter documents in document management system.
**Metadata completeness percentage.** Of the documents in your repository, what proportion have complete, accurate metadata (parties, dates, values, classification)? For contracts, what proportion have extraction of key commercial terms? Target: \>90% of executed contracts have complete standard metadata; \>75% have extracted commercial terms.
**Records classified percentage.** What proportion of documents in your repository have been assigned a retention classification and access classification? Target: \>95%.
**Policy compliance rate.** Are documents being retained and disposed of in accordance with your retention schedules? Are legal holds being managed correctly? Do audit logs show that access controls are being enforced? Target: \>98% of dispositions compliant with schedules; 100% of legal holds tracked and managed.
**Data quality audit results.** Periodic data quality audits check for consistency (entity names, taxonomy application, metadata accuracy). Sample 100 random records and assess for accuracy. Target: \>95% accuracy on entity resolution; \>90% accuracy on taxonomy application.
**System interoperability.** What proportion of data flows between source systems are automated via middleware versus manual re-entry? Target: \>90% of routine data flows automated.
### Information Governance Maturity Scorecard
<table header-row="true">
<tr>
<td>Capability</td>
<td>Immature (0–25)</td>
<td>Developing (25–50)</td>
<td>Managed (50–75)</td>
<td>Optimised (75–100)</td>
</tr>
<tr>
<td>**Classification**</td>
<td>No formal classification; inconsistent naming</td>
<td>Classification policy defined; partial rollout</td>
<td>Classified \>75% of documents; consistent application</td>
<td>\>95% of documents classified; automated enforcement</td>
</tr>
<tr>
<td>**Retention**</td>
<td>Retention schedules vague or absent</td>
<td>Retention schedules defined for major categories</td>
<td>Schedules documented and \>80% applied</td>
<td>Automated disposition workflows; \>98% compliance</td>
</tr>
<tr>
<td>**Legal Hold**</td>
<td>Holds managed informally; inconsistent</td>
<td>Processes documented; holds tracked in spreadsheet</td>
<td>Centralised holds tracking; some automation</td>
<td>Integrated holds management; automated notifications</td>
</tr>
<tr>
<td>**Data Quality**</td>
<td>Multiple entity names for same counterparty; inconsistent metadata</td>
<td>Entity resolution underway; taxonomy project initiated</td>
<td>Entity resolution \>80% complete; metadata \>75% structured</td>
<td>\>95% entity resolution; metadata \>90% complete</td>
</tr>
<tr>
<td>**Access Control**</td>
<td>Access based on role only; minimal audit</td>
<td>Classification linked to access; access logs maintained</td>
<td>Access controls enforced by system; audit trail complete</td>
<td>Real-time access monitoring; anomaly detection active</td>
</tr>
<tr>
<td>**Analytics**</td>
<td>No integrated reporting; multiple separate reports</td>
<td>Data warehouse planned; pilot BI tool</td>
<td>Warehouse with major source systems; dashboards for key metrics</td>
<td>Integrated analytics across all systems; AI models operational</td>
</tr>
</table>
Use this scorecard to identify priority improvement areas. If your organisation is at “immature” on Classification, that is a priority — you cannot confidently restrict access to privileged material if documents are not classified. If you are “immature” on Data Quality, that is your blocking issue for AI deployment — clean the data first.
## In the Trenches
**The Data Clean-Up That Unlocked AI**
An Australian financial services firm spent \$180,000 on an AI-powered contract analysis tool. The vendor demo was impressive: the tool could extract 47 different data points from a contract, identify risk clauses, and score overall contract risk. The firm’s General Counsel approved the investment based on the compelling demo.
Four weeks into deployment, the results were unusable. The tool was extracting entity names inconsistently (the same counterparty appeared under seven different name variations), misclassifying contract types (because the firm’s naming conventions differed substantially from the tool’s training taxonomy), and misidentifying standard internal clauses as “high risk” due to lack of context about the firm’s approved positions.
The Head of Legal Operations paused the AI deployment and redirected the team to a three-month data normalisation sprint. They resolved entity names across the contract repository (consolidating 2,800 unique entity strings into 940 canonical entities). They standardised contract type classifications. They tagged 200 approved clause variants as “standard” so the AI could distinguish them from genuinely non-standard language.
When the AI tool was reactivated against the normalised data, extraction accuracy jumped from 62% to 94%. Risk scoring became reliable enough to use in production workflows. Contract review was accelerated from an average of 6 hours to 90 minutes per contract. The tool that had been a \$180,000 write-off candidate became the foundation of the firm’s contract analytics capability and directly contributed to a 30% improvement in contract cycle time.
The lesson is universal: the AI worked as designed. The investment only delivered value once the data foundation was cleaned and normalised. Establishing information governance and data quality as foundational capabilities, ahead of advanced technologies, unlocks the full potential of every downstream investment.
## Checklist
- **Run an entity resolution check.** Pick your top 10 counterparties by contract volume or spend. Search for each across every system the legal department uses (CLM, matter management, e-billing, DMS, CRM). Count the name variations. If any entity has more than two variations, entity resolution is an immediate priority.
- **Define your authoritative sources.** For each major data category (contracts, matters, spend, work product), designate one system as the authoritative source. Document it. Communicate it. When someone asks “where is the definitive data on X?”, there should be one answer, not three.
- **Audit your retention schedules.** Review your current document retention policy. Does it cover all document types the legal department manages? Are the retention periods documented and justified (compliant with laws, regulatory requirements, business needs)? Are retention classifications actually being applied?
- **Estimate your metadata enrichment backlog.** How many contracts in your repository have structured metadata (party, value, dates, key terms) versus unstructured storage (a PDF with a filename)? The ratio of structured to unstructured is your metadata enrichment gap — and your AI-readiness indicator.
- **Map one end-to-end data flow.** Select a process (e.g., contract execution to revenue recognition) and trace the data from origin to destination. Note every point where data is manually transferred, reformatted, or re-entered. Each manual step is an error risk and a middleware automation candidate.
- **Classify a small sample of documents.** Take 20 documents at random from your repository and assign them a classification (public, internal, confidential, privileged). What fraction of the team would agree with your classifications? If consensus is unclear, your classification criteria need tightening and your team needs training.
## Suggested Reading
- [ARMA - Generally Accepted Recordkeeping Principles](https://www.arma.org/page/principles)
- [ISO 15489-1: Information and Documentation - Records Management](https://www.iso.org/standard/62542.html)
- [NIST Privacy Framework](https://www.nist.gov/privacy-framework)
- [NIST Cybersecurity Framework 2.0](https://www.nist.gov/cyberframework)
- [IAPP - Data Governance Resources](https://iapp.org/resources/topics/data-governance/)