Microsoft Research Introduces CORPGEN To Manage Multi Horizon Tasks For Autonomous AI Agents Using Hierarchical Planning and Memory

Thank you for reading this post, don't forget to subscribe!

Microsoft researchers have introduced CORPGEN, an architecture-agnostic framework designed to manage the complexities of realistic organizational work through autonomous digital employees. While existing benchmarks evaluate AI agents on isolated, single tasks, real-world corporate environments require managing dozens of concurrent, interleaved tasks with complex dependencies. The research team identifies this distinct problem class as Multi-Horizon Task Environments (MHTEs).

The Performance Gap in MHTEs

Empirical testing reveals that baseline computer using agents (CUAs) experience significant performance degradation when moved from single-task scenarios to MHTEs. Using three independent CUA implementations, completion rates dropped from 16.7% at 25% load to 8.7% at 100% load.

The research team identified four fundamental failure modes causing this decline:

Context Saturation: Context requirements grow O(N) with task count rather than O(1), rapidly exceeding the token window capacity.

Memory Interference: Information from one task often contaminates reasoning about another when multiple tasks share a single context window.

Dependency Graph Complexity: Corporate tasks form Directed Acyclic Graphs (DAGs) rather than linear chains, requiring complex topological reasoning.

Reprioritization Overhead: Decision complexity increases to O(N) per cycle because agents must constantly re-evaluate priorities across all active tasks.

The CORPGEN Architecture

To address these failures, CORPGEN implements Multi-Objective Multi-Horizon Agent (MOMA) capabilities through four primary architectural mechanisms.

(a) Hierarchical Planning

Strategic coherence is maintained through goal decomposition across three temporal scales:

Strategic Objectives (Monthly): High-level goals and milestones based on agent identity and role.

Tactical Plans (Daily): Actionable tasks for specific applications with priority rankings.

Operational Actions (Per-Cycle): Individual tool calls selected based on current state and retrieved memory.

(b) Sub-Agent Isolation

Complex operations, such as GUI automation or research, are isolated into modular sub-agents. These autonomous agents operate in their own context scopes and return only structured results to the host agent, preventing cross-task memory contamination.

(c) Tiered Memory Architecture

The system utilizes a three-layer memory structure to manage state:

Working Memory: Intended for immediate reasoning, this layer resets each cycle.

Structured Long-Term Memory (LTM): Stores typed artifacts such as plans, summaries, and reflections.

Semantic Memory: Uses Mem0 to support similarity-based retrieval over unstructured past context using embeddings.

(d) Adaptive Summarization

To bound context growth, CORPGEN employs rule-based compression. When context length exceeds 4,000 tokens, ‘critical content’ (such as tool calls and state changes) is preserved verbatim, while ‘routine content’ (intermediate reasoning) is compressed into structured summaries.

Experimental Results and Learning

Across three CUA backends (UFO2, OpenAI CUA, and hierarchical), CORPGEN achieved up to a 3.5x improvement over baselines, reaching a 15.2% completion rate compared to 4.3% for standalone UFO2 at 100% load.

Ablation studies indicate that experiential learning provides the largest performance gains. This mechanism distills successful task executions into canonical trajectories which are then indexed in a FAISS database. At execution time, similar trajectories are retrieved as few-shot examples to bias action selection toward validated patterns.

The research TEAM observed a significant discrepancy in evaluation methods. Artifact-based judgment (inspecting generated files and outputs) achieved a 90% agreement rate with human labels. In contrast, trace-based LLM judgment (relying on screenshots and execution logs) only achieved 40% agreement. This suggests that current benchmarks may systematically underestimate agent performance by relying on limited visual traces rather than the actual artifacts produced.

Key Takeaways

Identification of Multi-Horizon Task Environments (MHTEs): The research team defines a new class of problems called MHTEs, where agents must manage dozens of interleaved, long-horizon tasks (45+ tasks, 500-1500+ steps) within a single persistent context. This differs from traditional benchmarks that evaluate single tasks in isolation.

Discovery of Catastrophic Performance Degradation: Standard computer-using agents (CUAs) experience a ‘catastrophic’ drop in performance when task load increases, with completion rates falling from 16.7% at 25% load to 8.7% at 100% load.

Four Fundamental Failure Modes: The researchers identified why current agents fail under load: context saturation (O(N) growth), memory interference (task conflation), dependency complexity (managing Directed Acyclic Graphs), and reprioritization overhead (O(N) decision complexity).

Architectural Mitigation via CORPGEN: The CORPGEN framework addresses these failures through four core mechanisms: hierarchical planning for goal alignment, sub-agent isolation to prevent memory contamination, tiered memory (working, structured, and semantic), and adaptive summarization to manage token limits.

Significant Performance Gains through Experiential Learning: Evaluation across multiple backends showed that CORPGEN can improve performance by up to 3.5x over baselines. Ablation studies revealed that experiential learning—reusing verified successful trajectories—provides the largest performance boost among all architectural components.

Check out the Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Source link