Sitemap

Agentic Workflows To Balance Autonomy with Human Oversight

Agentic workflows are changing how knowledge workers are interacting with technology.

5 min readMay 29, 2025

--

Workflows use autonomous AI Agents to perform complex tasks with minimal human intervention, but oversight.

While AI Agents introduce a new level of autonomy, they still require strong human supervision to ensure accuracy and reliability.

What Are Agentic Workflows?

Agentic workflows refer to processes where AI Agents — powered by advanced large language models (LLMs) — execute multi-step tasks autonomously.

Unlike traditional LLMs that generate text in response to prompts, AI Agents can

  1. plan,
  2. reason, and
  3. interact with external tools, such as APIs, databases, or web browsers, to achieve specific goals.

For example, an AI Agent might book a flight, draft an email, or scrape data from a website by breaking down the task into actionable steps.

Workflows leverage a more modular approach AI Agent frameworks, that integrate LLMs with toolkits for decision-making and task orchestration.

If you like this article & want to show some love ❤️

- Clap 50 times, each one helps more than you think! 👏

- Follow me on Medium and subscribe for free. 🫶

- Also follow me on LinkedIn or on X! 🙂

The Role of Human Supervision

While AI Agents can operate independently, the study highlights that their autonomy is most effective when paired with human supervision.

Humans provide

  1. context,
  2. set objectives, and
  3. intervene

when AI Agents misinterpret tasks or produce inaccurate results.

For instance, an AI Agent tasked with browsing the web for market research might misinterpret ambiguous queries or fetch outdated data.

Human supervisors can clarify instructions, verify sources, or adjust the agent’s approach to ensure alignment with the intended outcome.

Challenges in Accuracy & Insights from Benchmarks

The study proposes a taxonomy of approximately 60 benchmarks developed between 2019 and 2025 to evaluate AI Agents across domains like general knowledge reasoning, mathematical problem-solving, code generation and multimodal tasks.

In my article The Hard Truth About AI Agents & Accuracy I look at the limitations of AI Agents in terms of accuracy and reliability, despite the hype surrounding their capabilities.

Key points include:

Speech Recognition Benchmark

I note that accurate speech recognition, now embedded in apps like Grok and ChatGPT, has reached human-level performance (95%+ accuracy), as exemplified by Google’s ML Voice Recognition surpassing this threshold by 2017.

However, this success is not yet mirrored in general AI Agent performance.

AI Agent Accuracy

Current AI Agents struggle with low success rates.

For instance, the top-performing agents like S2 achieves only a 34.5% success rate with 50 steps, meaning over 60% failure.

Even advanced models like Claude, which leads in computer-use tasks, scores only 14.9% on the OSWorld benchmark, far below human-level performance (70–75%).

OpenAI’s Operator achieves 30–50% accuracy in web and computer tasks, still lagging behind human capability.

Overfitting and Benchmarks

I am quite sure researchers often overfit AI models to score high on benchmarks, which may not reflect real-world performance.

This inflates perceived capabilities and masks true limitations.

Complexity and Cost

AI Agents are becoming more complex, integrating functions like computer vision, task decomposition and web search.

But, this increases computational costs, as AI Agents often call underlying language models multiple times.

The article suggests a shift toward granular frameworks like LangChain and LlamaIndex for better efficiency.

Human Supervision

AI Agents can operate with weak human supervision, allowing minor corrections along the way, but their high cost and limited accuracy make them less viable for widespread enterprise use compared to more reliable agentic workflows.

While AI agents are advancing, their current accuracy and cost inefficiencies make them unsuitable for many production environments,

Enhancing Accuracy with Human-AI Collaboration

To address these accuracy challenges, the study advocates for stronger human-AI collaboration protocols.

By combining the scalability of AI Agents with human judgment, agentic workflows can achieve higher reliability.

For example, frameworks that allow humans to set clear parameters or review intermediate outputs can catch errors early.

Additionally, iterative improvements in benchmark design — focusing on real-world scenarios — can help developers fine-tune agents for better performance.

Conclusion

Agentic workflows represent a leap toward autonomous AI systems capable of tackling complex tasks with minimal human input.

However, as the study highlights, their autonomy is most effective when paired with robust human supervision.

Benchmarks reveal that while AI Agents show promise in computer use and web browsing, accuracy issues persist, particularly in dynamic or ambiguous scenarios.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

No responses yet