AI agents are now in production, but in many cases they’re not ready for the roles we’re giving them. They underperform. They hallucinate. Some invent refunds, cite fake laws, or answer with confidence when they should escalate. The usual fix? Fine-tune the model, rewrite the prompt, upgrade to the next release. Still not enough.

This isn’t a model problem. It’s a design problem. In this piece, we take a step back. What if we stopped treating models as tools, and applied thinking from how we recruit, train and develop our human colleagues. Not just smarter systems, but systems with structure. With protocols, responsibilities, and oversight. That’s where the shift begins.

Salesforce’s CRMArena-Pro benchmark recently found that AI agents succeed in just 35% of multi-step CRM tasks, despite rapid model improvements and growing enterprise adoption [1]. But that’s not an outlier. Across the industry, the evidence is clear:

OpenAI’s internal tests show hallucination rates up to 79% in reasoning tasks for their latest models, more than double earlier versions [2, 3].
In real-world deployments, hallucinations account for over one-third of all documented failures in enterprise LLM systems [4].
The Phare benchmark (Pervasive Hallucination Assessment) confirms that top-tier models like GPT-4 and Claude fabricate information across domains, especially when reasoning or tool use is required (5, 6).
In customer-facing settings, hallucinations have led to compliance breaches, legal sanctions, and measurable brand damage (7, 8).
According to Gartner’s 2025 AI TRiSM findings, inaccurate or unwanted outputs from generative agents, i.e., hallucinations, rank among the top-three risks hindering scaled adoption in regulated industries such as finance, telecom, and public services. Gartner further highlights that declining trust in these autonomous systems (often stemming from hallucinatory behavior) is driving investment in AI governance platforms, runtime monitoring tools, and disinformation controls to enable enterprise-scale deployment [9, 10, 11].

Hallucinations are operational liabilities, eroding trust, misrepresenting actions, and exposing companies to reputational, regulatory, and financial risk. But here’s the problem: we keep trying to fix this at the model level, better prompts, better training data, better models. That’s not enough.

"Hallucinations aren’t necessarily model failures, they’re more likely onboarding failures. When we treat foundation models as digital colleagues but skip the structure, context, and escalation protocols we give human hires, we shouldn’t be surprised when they start making things up. A digital colleague without grounding isn’t unreliable, it’s unsupported."

Jens Eriksvik, Algorithma

These agents are being dropped into enterprise roles without the systems that make those roles work. At Algorithma, we don’t treat models as tools. We treat them as digital colleagues, i.e. a digital workforce.. And that mindset changes everything.

Hiring a digital colleague isn’t enough, it needs proper onboarding.

Foundation models can look impressive out of the box. They’ve read the literature, understand the terminology, and can follow instructions. In that sense, they’re like a strong junior hire direct from the university; smart, adaptable, and fluent in the domain. But they haven’t been onboarded. They don’t know your workflows, your data structures, or your compliance obligations.

Despite that, many organisations deploy them directly into production settings. They’re expected to handle customer queries, process actions, and generate business-critical outputs, without proper context, integration, or oversight. It’s the equivalent of asking a new hire to approve refunds, draft legal statements or prepare full exec presentations, all on their first day.

What gets labelled as hallucination is often a result of this gap. Without access to reliable systems of record, models guess. They fill in missing context with something plausible. Benchmarks like Salesforce’s LLM agent [12] evaluation have shown that agents routinely fabricate tool actions, stating that a refund was processed or a database was updated when no such backend call ever occurred. We have seen this firsthand at Algorithma with our work with our AI agents.

From a systems perspective, this is a question of responsibility and reliability. When we treat foundation models as digital colleagues, as we’ve outlined previously, they need to be given the same structures as any other team member: clear scope, access to relevant tools, and a way to escalate or defer when unsure. Not doing so leads to failure modes that aren’t technical, they’re operational [13].

The onboarding process for a digital colleague should reflect the organisation it works within. That includes grounding it in internal knowledge, aligning it with workflows, and setting thresholds for when it should ask for help or hand off. It also includes performance monitoring, feedback mechanisms, and revision of its span of responsibility over time [14].

We’ve made the case that this shift, from endpoint to colleague, requires a different mental model. One where agents are part of the team, not just interfaces for pre-trained intelligence. And one where trust is built through alignment, not assumed through fluency [15]. Hiring the model is the easy part. Onboarding it is where the work begins.

Hallucinations aren’t anomalies. They’re design signals.

Between 2022 and 2025, hallucination rates increased in some advanced models, particularly in multi-turn and tool-using workflows. As mentioned, Salesforce found that only 35% of such tasks were completed correctly [16]. And the industry is full of real-world consequences:

These are the result of AI agents operating without real access, grounding, or feedback loops. The hallucination isn’t a model flaw, it’s a system flaw.

Architecting for colleagues, not shortcuts

The real fix is better architecture. In an AI-native enterprise, hallucinations are signals of architectural immaturity - gaps in grounding, coordination, and oversight. The most effective strategies we’ve seen are built around five practical principles:

Grounding over guessing: Use retrieval-augmented generation (RAG) and structured tool integrations to anchor the model’s output in real data. It’s an open-book test. Don’t treat it like a memory game. [See more in 17]
Reasoning with transparency: Use prompting strategies like ReAct that require the agent to explain its thinking before taking action. Make decisions visible. Traceable logic makes failure recoverable.
Refusal is a feature: Design workflows and reward functions where the agent can say “I don’t know” or escalate when uncertain. Don’t optimize for confidence at the expense of accuracy.
Protocols, not prompts: Agents need rules, shared state, tool access, and escalation logic. Prompts alone can’t coordinate work across systems. Give agents a role, a process, and guardrails. Not just a starting sentence. [See more in 18]
HITL and HOTL: Humans in and on the loop: Instrumentation matters. Confidence thresholds, scenario testing, audit trails, and fallback logic should be built in. Oversight isn’t a last resort. It’s part of how responsible systems operate. [See more in 19]

These principles are about designing for accountability. Once an agent is working inside your business, the real question isn’t whether it’s capable, it’s whether the system around it knows when to trust it, when to question it, and when to intervene.

“Hallucinations signal architectural failure. When you drop a foundation model into production without grounding, state management, or escalation logic, it will guess. That’s not a model bug, it’s a system design flaw. If you’re serious about digital colleagues, build like it: RAG pipelines, action logging, refusal thresholds, and human-in/on-the-loop controls. Otherwise, you’re not deploying intelligence, you’re shipping guesswork.”

Alex Ekdahl, Algorithma

Underperformance is not a failure of intelligence. It’s a failure of workplace design.

When an agent hallucinates a refund, the issue is that the system around it hasn’t been built to support the task. The model needs access to the refund system. It needs context from the ticket history. It needs logging and escalation protocols that catch low-confidence actions before they reach production. And it needs feedback when it gets something wrong, not just to improve future behaviour, but to help define the limits of what it should be doing at all.

We’ve seen this in practice. In customer service automation, hallucinated actions are almost always linked to lack of grounding or oversight, claims of cancelled orders or issued refunds that never reached the underlying system. Not because the model is malicious or broken, but because the workflow is disconnected.

As we wrote in Designing the AI-Native Enterprise, the real shift isn’t from manual to automated. It’s from fragmented systems to coordinated operations, where agents are assigned roles, not just asked questions. In that environment, performance improves not because the models are perfect, but because the architecture is designed for reliability. [20]

That also means reframing how we think about deployment. It’s about digital colleagues and responsibility in context. It’s not only about what a model can do on paper, but how that capability is coordinated within a system that knows when to trust it, when to supervise it, and when to hand over.

Before agents can go autonomous, they need supervision.

Autonomy is earned. In enterprise settings, that means agents must first operate under structure, where expectations are clear, actions are tracked, and escalation paths exist. Not because the model can’t reason, but because it’s still learning where its responsibility ends.

Structured training matters. Not just fine-tuning on more examples, but role-specific onboarding that reflects the organisation’s systems, policies, and constraints. Evaluation metrics help, but they’re not enough. Agents need performance reviews, aligned with actual workflows and monitored over time, as described in Performance Reviews for Digital Colleagues [21]. Reward functions offer incentives, but without feedback loops, there’s no mechanism for correction. And blind trust in outputs, especially in edge cases, only increases risk.

“Autonomy in enterprise AI isn’t granted, it’s earned. A digital colleague needs structure before it can operate independently: scoped responsibilities, escalation paths, and feedback loops. Without that, you don’t have autonomy, you have unchecked output. Maturity isn’t about model size, it’s about how much responsibility your system is designed to support.”

Kristofer Kaltea, Algorithma

Escalation protocols need to be part of the system. So does legal framing. If we expect agents to take on meaningful tasks, we need to address accountability explicitly, not as an afterthought, but as a design input. That includes knowing who’s responsible when things go wrong, and what mechanisms are in place to intervene.

As we’ve outlined in They’re Employees, Not Endpoints and When the Agent Takes Over, the goal isn’t full autonomy at any cost. It’s appropriate autonomy, earned through reliability, and bounded by governance [22, 23]. The real measure of enterprise AI maturity isn’t how big the model is. It’s how much responsibility the system around it is prepared to assign.

The next phase of enterprise AI won’t be model-first. It will be architecture-first.

If you want agents that perform, the work doesn’t start with the model. It starts with the system around it. Here’s how to move forward:

Define the role before you deploy the agent: Clarify the task. What system is it working in? What tools does it need access to? What are the boundaries of its responsibility?
Ground the agent in your real data and workflows: Don’t rely on memory. Connect the model to trusted sources, via retrieval, API integrations, or internal documentation pipelines.
Design escalation paths and refusal logic from day one: Give the agent structured ways to say “I don’t know” or escalate cases. No system that can’t defer should be in production.
Instrument for oversight: Add logging, confidence thresholds, audit trails, and human review options. Track decisions and flag edge cases. Oversight isn’t optional.
Establish feedback and iteration loops: Monitor performance in context. Set up regular review cycles, just like you would with a new hire. Update scope based on what the agent can reliably handle.
Frame governance explicitly: Assign accountability. Make it clear who owns outcomes when agents act, and what happens when things go wrong.
Scale span of responsibility, not just capability: Start narrow. Expand only as the agent earns trust. Maturity isn’t how much it can say, it’s how much it can own, safely.

Train the colleague. Build the system. And design the enterprise that’s ready to work with them.

Don’t blame the model - train the (digital) colleague

Zero‑trust, infinite reasoning: Securing AI‑native physical security operations

The new allies of GRC professionals: AI agents