
In this tutorial, we develop a comprehensive benchmarking framework to evaluate various types of agentic AI systems on real-world enterprise software tasks. We design a suite of diverse challenges, from data transformation and API integration to workflow automation and performance optimization, and assess how various agents, including rule-based, LLM-powered, and hybrid ones, perform across these domains. By running structured benchmarks and visualizing key performance metrics, such as accuracy, execution time, and success rate, we gain a deeper understanding of each agent’s strengths and trade-offs in enterprise environments. Check out the Full Codes here.
import time
import random
from typing import Dict, List, Any, Callable
from dataclasses import dataclass, asdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
@dataclass
class Task:
id: str
name: str
description: str
category: str
complexity: int
expected_output: Any
@dataclass
class BenchmarkResult:
task_id: str
agent_name: str
success: bool
execution_time: float
accuracy: float
error_message: str = “”
class EnterpriseTaskSuite:
def __init__(self):
self.tasks = self._create_tasks()
def _create_tasks(self) -> List[Task]:
return [
Task(“data_transform”, “CSV Data Transformation”,
“Transform customer data by aggregating sales”, “data_processing”, 3,
{“total_sales”: 15000, “avg_order”: 750}),
Task(“api_integration”, “REST API Integration”,
“Parse API response and extract key metrics”, “integration”, 2,
{“status”: “success”, “active_users”: 1250}),
Task(“workflow_automation”, “Multi-Step Workflow”,
“Execute data validation -> processing -> reporting”, “automation”, 4,
{“validated”: True, “processed”: 100, “report_generated”: True}),
Task(“error_handling”, “Error Recovery”,
“Handle malformed data gracefully”, “reliability”, 3,
{“errors_caught”: 5, “recovery_success”: True}),
Task(“optimization”, “Query Optimization”,
“Optimize database query performance”, “performance”, 5,
{“execution_time_ms”: 45, “rows_scanned”: 1000}),
Task(“data_validation”, “Schema Validation”,
“Validate data against business rules”, “validation”, 2,
{“valid_records”: 95, “invalid_records”: 5}),
Task(“reporting”, “Executive Dashboard”,
“Generate KPI summary report”, “analytics”, 3,
{“revenue”: 125000, “growth”: 0.15, “customer_count”: 450}),
Task(“integration_test”, “System Integration”,
“Test end-to-end integration flow”, “testing”, 4,
{“all_systems_connected”: True, “latency_ms”: 120}),
]
def get_task(self, task_id: str) -> Task:
return next((t for t in self.tasks if t.id == task_id), None)
We define the core data structures for our benchmarking system. We create the Task and BenchmarkResult data classes and initialize the EnterpriseTaskSuite, which holds multiple enterprise-relevant tasks such as data transformation, reporting, and integration. We laid the foundation for consistently evaluating different types of agents across these tasks. Check out the Full Codes here.
def __init__(self, name: str):
self.name = name
def execute(self, task: Task) -> Dict[str, Any]:
raise NotImplementedError
class RuleBasedAgent(BaseAgent):
def execute(self, task: Task) -> Dict[str, Any]:
time.sleep(random.uniform(0.1, 0.3))
if task.category == “data_processing”:
return {“total_sales”: 15000 + random.randint(-500, 500),
“avg_order”: 750 + random.randint(-50, 50)}
elif task.category == “integration”:
return {“status”: “success”, “active_users”: 1250}
elif task.category == “automation”:
return {“validated”: True, “processed”: 98, “report_generated”: True}
else:
return task.expected_output
We introduce the base agent structure and implement the RuleBasedAgent, which mimics traditional automation logic using predefined rules. We simulate how such agents execute tasks deterministically while maintaining speed and reliability, giving us a baseline for comparison with more advanced agents. Check out the Full Codes here.
def execute(self, task: Task) -> Dict[str, Any]:
time.sleep(random.uniform(0.2, 0.5))
accuracy_boost = 0.95 if task.complexity >= 4 else 0.90
result = {}
for key, value in task.expected_output.items():
if isinstance(value, (int, float)):
variation = value * (1 – accuracy_boost)
result[key] = value + random.uniform(-variation, variation)
else:
result[key] = value
return result
class HybridAgent(BaseAgent):
def execute(self, task: Task) -> Dict[str, Any]:
time.sleep(random.uniform(0.15, 0.35))
if task.complexity <= 2:
return task.expected_output
else:
result = {}
for key, value in task.expected_output.items():
if isinstance(value, (int, float)):
variation = value * 0.03
result[key] = value + random.uniform(-variation, variation)
else:
result[key] = value
return result
We develop two intelligent agent types, the LLMAgent, representing reasoning-based AI systems, and the HybridAgent, which combines rule-based precision with LLM adaptability. We design these agents to show how learning-based methods improve task accuracy, especially for complex enterprise workflows. Check out the Full Codes here.
def __init__(self, task_suite: EnterpriseTaskSuite):
self.task_suite = task_suite
self.results: List[BenchmarkResult] = []
def run_benchmark(self, agent: BaseAgent, iterations: int = 3):
print(f”\n{‘=’*60}”)
print(f”Benchmarking Agent: {agent.name}”)
print(f”{‘=’*60}”)
for task in self.task_suite.tasks:
print(f”\nTask: {task.name} (Complexity: {task.complexity}/5)”)
for i in range(iterations):
result = self._execute_task(agent, task, i+1)
self.results.append(result)
status = “✓ PASS” if result.success else “✗ FAIL”
print(f” Run {i+1}: {status} | Time: {result.execution_time:.3f}s | Accuracy: {result.accuracy:.2%}”)
Here, we build the core of our benchmarking engine, which manages agent evaluation across the defined task suite. We implement methods to run each agent multiple times per task, log results, and measure key parameters like execution time and accuracy. This creates a systematic and repeatable benchmarking loop. Check out the Full Codes here.
start_time = time.time()
try:
output = agent.execute(task)
execution_time = time.time() – start_time
accuracy = self._calculate_accuracy(output, task.expected_output)
success = accuracy >= 0.85
return BenchmarkResult(task_id=task.id, agent_name=agent.name, success=success,
execution_time=execution_time, accuracy=accuracy)
except Exception as e:
execution_time = time.time() – start_time
return BenchmarkResult(task_id=task.id, agent_name=agent.name, success=False,
execution_time=execution_time, accuracy=0.0, error_message=str(e))
def _calculate_accuracy(self, output: Dict, expected: Dict) -> float:
if not output:
return 0.0
scores = []
for key, expected_val in expected.items():
if key not in output:
scores.append(0.0)
continue
actual_val = output[key]
if isinstance(expected_val, bool):
scores.append(1.0 if actual_val == expected_val else 0.0)
elif isinstance(expected_val, (int, float)):
diff = abs(actual_val – expected_val)
tolerance = abs(expected_val * 0.1)
score = max(0, 1 – (diff / (tolerance + 1e-9)))
scores.append(score)
else:
scores.append(1.0 if actual_val == expected_val else 0.0)
return np.mean(scores) if scores else 0.0
We define the task execution logic and the accuracy computation. We measure each agent’s performance by comparing their outputs against expected results using a scoring mechanism. This step ensures our benchmarking process is quantitative and fair, providing insights into how closely agents align with business expectations. Check out the Full Codes here.
df = pd.DataFrame([asdict(r) for r in self.results])
print(f”\n{‘=’*60}”)
print(“BENCHMARK REPORT”)
print(f”{‘=’*60}\n”)
for agent_name in df[‘agent_name’].unique():
agent_df = df[df[‘agent_name’] == agent_name]
print(f”{agent_name}:”)
print(f” Success Rate: {agent_df[‘success’].mean():.1%}”)
print(f” Avg Execution Time: {agent_df[‘execution_time’].mean():.3f}s”)
print(f” Avg Accuracy: {agent_df[‘accuracy’].mean():.2%}\n”)
return df
def visualize_results(self, df: pd.DataFrame):
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle(‘Enterprise Agent Benchmarking Results’, fontsize=16, fontweight=”bold”)
success_rate = df.groupby(‘agent_name’)[‘success’].mean()
axes[0, 0].bar(success_rate.index, success_rate.values, color=[‘#3498db’, ‘#e74c3c’, ‘#2ecc71’])
axes[0, 0].set_title(‘Success Rate by Agent’, fontweight=”bold”)
axes[0, 0].set_ylabel(‘Success Rate’)
axes[0, 0].set_ylim(0, 1.1)
for i, v in enumerate(success_rate.values):
axes[0, 0].text(i, v + 0.02, f'{v:.1%}’, ha=”center”, fontweight=”bold”)
time_data = df.groupby(‘agent_name’)[‘execution_time’].mean()
axes[0, 1].bar(time_data.index, time_data.values, color=[‘#3498db’, ‘#e74c3c’, ‘#2ecc71’])
axes[0, 1].set_title(‘Average Execution Time’, fontweight=”bold”)
axes[0, 1].set_ylabel(‘Time (seconds)’)
for i, v in enumerate(time_data.values):
axes[0, 1].text(i, v + 0.01, f'{v:.3f}s’, ha=”center”, fontweight=”bold”)
df.boxplot(column=’accuracy’, by=’agent_name’, ax=axes[1, 0])
axes[1, 0].set_title(‘Accuracy Distribution’, fontweight=”bold”)
axes[1, 0].set_xlabel(‘Agent’)
axes[1, 0].set_ylabel(‘Accuracy’)
plt.sca(axes[1, 0])
plt.xticks(rotation=15)
task_complexity = {t.id: t.complexity for t in self.task_suite.tasks}
df[‘complexity’] = df[‘task_id’].map(task_complexity)
complexity_perf = df.groupby([‘agent_name’, ‘complexity’])[‘accuracy’].mean().unstack()
complexity_perf.plot(kind=’line’, ax=axes[1, 1], marker=”o”, linewidth=2)
axes[1, 1].set_title(‘Accuracy by Task Complexity’, fontweight=”bold”)
axes[1, 1].set_xlabel(‘Task Complexity’)
axes[1, 1].set_ylabel(‘Accuracy’)
axes[1, 1].legend(title=”Agent”, loc=”best”)
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
if __name__ == “__main__”:
print(“Enterprise Software Benchmarking for Agentic Agents”)
print(“=”*60)
task_suite = EnterpriseTaskSuite()
benchmark = BenchmarkEngine(task_suite)
agents = [RuleBasedAgent(“Rule-Based Agent”), LLMAgent(“LLM Agent”), HybridAgent(“Hybrid Agent”)]
for agent in agents:
benchmark.run_benchmark(agent, iterations=3)
results_df = benchmark.generate_report()
benchmark.visualize_results(results_df)
results_df.to_csv(‘agent_benchmark_results.csv’, index=False)
print(“\nResults exported to: agent_benchmark_results.csv”)
We generate detailed reports and create visual analytics for performance comparison. We analyze metrics such as success rate, execution time, and accuracy across agents and task complexities. Finally, we export the results to CSV file, completing a full enterprise-grade evaluation workflow.
In conclusion, we implemented a robust, extensible benchmarking system that enables us to measure and compare the efficiency, adaptability, and accuracy of multiple agentic AI approaches. We observed how different architectures excel at different levels of task complexity and how visual analytics highlight performance trends. This process enables us to evaluate existing agents and provides a strong foundation for next-generation enterprise AI agents, optimized for reliability and intelligence.
Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

