Transforming AI from Generalist Tools to Specialised Teammates

Transforming AI from Generalist Tools to Specialised Teammates: Measuring Accuracy, Cost and Efficiency Gains in Human-in-the-Loop Regulatory Workflows

This whitepaper presents new research comparing generalist and specialist language models in Human-in-the-Loop (HITL) workflows – highlighting measurable gains in accuracy, stability, cost and efficiency.

Learn how domain-specific AI can transform regulatory bottlenecks into scalable, auditable and high-performance systems.

Download the white paper

Executive summary

Authors: Max Ashton-Lelliott and Robert Wardrop

Organisations face a fundamental challenge with artificial intelligence adoption. The impressive capabilities of generalist Large Language Models (LLMs) like GPT and Claude has attracted widespread adoption – recent surveys indicate more than 40% of knowledge workers are using AI tools to increase their personal productivity. But the same users who have integrated these tools into their personal workflows describe them as unreliable when encountered within enterprise workflows. Unsurprisingly, concern about the quality of AI model output is cited by companies as a leading cause of GenAI pilot failure. This dichotomy is prevalent in the legal domain with legal and regulatory professionals reportedly using generalist LLMs for initial drafting and research while being concerned about AI-generated hallucinations. Legal LLMs are being adopted by law firms that encompass contract analysis, case law interpretation and negotiated language, but there are many tasks that go beyond general legal language processing and require more challenging classification of regulatory and policy text rather than on broader legal document processing.

Enterprises are experimenting with Agentic AI systems consisting of Small Language Models (SLMs) performing specialised tasks in sequence, collaborating with humans to create Human-in-the-Loop (HITL) workflows. The viability of Agentic AI systems in these workflows remains highly dependent on model performance, and there is a limit to how automated complex legal and regulatory tasks can be in the legal and regulatory domain. Inaccurate or inconsistent output in any step in the sequence cascades through the entire workflow pipeline to create more work, not less, for those in the workflow. For AI to be a viable part of a workflow it must exceed an acceptable threshold of accuracy while also having the speed, stability and resource efficiency necessary for effective collaboration.

Agentic AI systems are promising, but relatively little research has been published comparing the performance of specialised SLMs with generalist LLMs for workflow tasks with low tolerance for inaccuracy. Our research addresses this knowledge gap by comparing the performance of a specialised SLM versus a panel of generalist LLM models for a 2-level document classification task that is commonly undertaken in legal and compliance workflows. We use the output of our tests to estimate the impact of the performance differences on HITL legal and regulatory workflows and consider the strategic implications of specialised SLM deployment for organisations.

Our results demonstrate that our specialised small language model trained on structured legal and regulatory data excels relative to generalist LLMs across 5 dimensions relevant to the successful deployment of enterprise AI:

1 Performance superiority through specialisation

The RegGenome specialist model outperforms the best performing provider-hosted LLM (Gemini 2.5 Pro) by 16 percentage points (38% relative gain) for classifying Anti-Money Laundering (AML) regulatory documents. We find that the specialisation advantage becomes amplified in newer, less-established regulatory domains; the RegGenome model achieved a 21 percentage point (72% relative gain) versus Generalist LLMs for classifying cryptocurrency documents, highlighting the significant value of domain-specific model training in rapidly evolving turbulent domains, where human subject matter expertise may be in short supply.

2 Addressing the stability crisis in AI systems

The leading provider-hosted LLMs produce different answers to identical questions up to 100 times as often as the RegGenome specialised SLM. This instability represents a fundamental barrier to enterprise adoption in regulatory contexts, where consistency and predictability are non-negotiable requirements. We measure stability with a novel measurement framework using Mean Absolute Difference (MAD) and Cosine Distance metrics to quantify this instability, providing organisations with reproducible methods to assess AI reliability.

3 Economic and operational outperformance

The specialist SLM operates at 1/80th the cost of the generalist LLMs while consuming an estimated 1/200th of the energy consumed by the LLMs. These efficiency gains enable rapid, cost-effective review cycles that can transform legal and regulatory workflows from bottlenecks into competitive advantages. A single GPU can process roughly 250,000 pages/day with deterministic outputs and full data sovereignty.

4 Rapid adaptation to domain turbulence

Our research included the creation of a model trained on a broad dataset spanning multiple domain taxonomies to test whether it could be a ‘meta-learner’, effectively going beyond learning the classification labels of a single task to learn how to classify as a transferable skill. A RegGenome unified model achieved a 0.38 Weighted F1-Score on completely unseen taxonomies which significantly outperforms the raw open-source baseline, Mistral 3 Instruct (0.31). The 7-point uplift between the 2 models suggests trained SLMs have the potential to provide immediate utility for creating the structures and annotated data needed to fine-tune high performing models in new domains without lengthy retraining cycles.

5 Compounding benefits in multi-agent workflows

In a 3-step case study pipeline: Identify (L1), Extract (LLM), Classify (L2), small per-step gains compound into large end-to-end improvements. Using the specialist model at the decision points delivered a 64% relative increase in correctly processed AML documents versus pipelines built on the best provider-hosted LLM. In turbulent cryptocurrency domains, this advantage doubled to 112%. For HITL review, the specialist approach reduced human review time by approximately 40% compared to manual annotation, while API-based approaches delivered only marginal time savings with much longer processing latency.

While our experiments use regulatory text, these findings generalise to any high-stakes workflow where tasks decompose into sequential steps, decisions must be reproducible for audit and risk management and domain language exhibits structure that rewards targeted fine-tuning. Examples span financial crime operations, clinical coding, safety incident triage, environmental compliance and audit analytics. In these settings, the combination of generalist LLMs for extraction and drafting with specialised SLMs owning classification and decision points creates a powerful collaborative architecture that transforms AI from an unreliable tool into a dependable teammate, one that scales reliably across domains where consistency, accountability and speed are mission critical.

Top

First Singapore visit by Dean launches leadership programme

Need help funding your degree programme studies at Cambridge Judge?

Not sure which programme is for you?

Sustainability in action

Find an expert

Find an expert

Leave your mark in LT1

Transforming AI from Generalist Tools to Specialised Teammates

Transforming AI from Generalist Tools to Specialised Teammates: Measuring Accuracy, Cost and Efficiency Gains in Human-in-the-Loop Regulatory Workflows

Executive summary

1

Performance superiority through specialisation

2

Addressing the stability crisis in AI systems

3

Economic and operational outperformance

4

Rapid adaptation to domain turbulence

5

Compounding benefits in multi-agent workflows

For staff & students

For staff & students