Most agent frameworks still run a predefined Reason, Act, Observe loop, so the agent can only use the tools that are injected in the prompt. This works for small tasks, but it fails when the toolset is large, when the task is long, and when the agent must change strategy in the middle of reasoning. The team from Renmin University of China and Xiaohongshu proposes DeepAgent as an end to end deep reasoning agent that keeps all of this inside one coherent reasoning process.


Unified Reasoning With On Demand Tool Discovery
DeepAgent lets the model output four action types directly in text, internal thought, tool search, tool call, and memory fold. When the agent decides to search, it queries a dense index that contains tool descriptions from large registries, for example 16,000 plus RapidAPI tools and 3,912 ToolHop tools, then it receives only the top ranked tools back in context. This makes tool access dynamic, the model does not depend on a front loaded tool list, and it stays aligned with real environments where tools change.
Autonomous Memory Folding for Long Horizon Tasks
Long sequences of tool calls, web results, and code responses will overflow the context. DeepAgent solves this with an autonomous memory folding step. When the model emits the fold token, an auxiliary LLM compresses the full history into three memories, Episodic Memory that records task events, Working Memory that records the current sub goal and recent issues, and Tool Memory that records tool names, arguments, and outcomes. These memories are fed back as structured text, so the agent continues from a compact but information rich state.
ToolPO, Reinforcement Learning for Tool Use
Supervised traces do not teach robust tool use, because correct tool calls are only a few tokens inside a long generation. The research team introduce Tool Policy Optimization, ToolPO, to fix this. ToolPO runs rollouts on LLM simulated APIs, so training is stable and cheap, then it attributes reward to the exact tool call tokens, this is tool call advantage attribution, and it trains with a clipped PPO style objective. This is how the agent learns not only to call tools, but also to decide when to search and when to fold memory.


Benchmarks, Labeled Tools vs Open Set Tools
The research team evaluates on 5 general tool use benchmarks, ToolBench, API Bank, TMDB, Spotify, ToolHop, and on 4 downstream tasks, ALFWorld, WebShop, GAIA, HLE. In the labeled tool setting, where every method is given the exact tools it needs, DeepAgent 32B RL with a QwQ 32B backbone reports 69.0 on ToolBench, 75.3 on API Bank, 89.0 on TMDB, 75.4 on Spotify, and 51.3 on ToolHop, which is the strongest 32B level result across all 5 datasets. Workflow baselines such as ReAct and CodeAct can match single datasets, for example ReAct with strong models is high on TMDB and Spotify, but none of them stay high on all 5, so the fair summary is that DeepAgent is more uniform, not that others are always low.
In the open set retrieval setting, which is the realistic one, DeepAgent must first find tools and then call them. Here DeepAgent 32B RL reaches 64.0 on ToolBench and 40.6 on ToolHop, while the strongest workflow baselines reach 55.0 on ToolBench and 36.2 on ToolHop, so the end to end agent still holds the lead. The research team also shows that autonomous tool retrieval itself lifts workflow agents, but DeepAgent gains more, which confirms that the architecture and the training are matched to large toolsets.


Downstream Environments
On ALFWorld, WebShop, GAIA, and HLE, all under a 32B reasoning model, DeepAgent reports 91.8 percent success on ALFWorld, 34.4 percent success and 56.3 score on WebShop, 53.3 on GAIA, and a higher score than workflow agents on HLE. These tasks are longer and noisier, so the combination of memory folding and ToolPO is the likely source of the gap.
Key Takeaways
DeepAgent keeps the whole agent loop inside one reasoning stream, the model can think, search tools, call them, and continue, so it is not limited to a fixed ReAct style workflow.
It uses dense retrieval over large tool registries, 16,000 plus RapidAPI tools and about 3,900 ToolHop tools, so tools do not have to be pre listed in the prompt, they are discovered on demand.
The autonomous memory folding module compresses long interaction histories into episodic, working, and tool memories, which prevents context overflow and keeps long horizon reasoning stable.
Tool Policy Optimization, ToolPO, trains tool use end to end with simulated APIs and token level advantage attribution, so the agent learns to issue correct tool calls, not only to reach the final answer.
On 5 tool benchmarks and 4 downstream tasks, DeepAgent at 32B scale is more consistent than workflow baselines in both labeled tool and open set settings, especially on ToolBench and ToolHop where tool discovery matters most.


DeepAgent is a practical step toward agent architectures that do not depend on fixed tool prompts, because it unifies autonomous thinking, dense tool retrieval over 16,000 plus RapidAPIs and 3,900 plus ToolHop tools, structured tool calling, and memory folding in one loop. The use of LLM simulated APIs in ToolPO is an engineering choice, but it solves the latency and instability problem that hurts prior tool agents. The evaluation shows consistent 32B level gains in both labeled tool and open set settings, not isolated peaks. This release makes large toolspaces actually usable for LLM agents. Overall, DeepAgent confirms that end to end tool agents with memory and RL are emerging as the default pattern.
Check out the Paper and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

