Improving MCP tool call performance through LLM code generation

zbowling 12 hours ago

I hacked together a new MCP server this weekend that can significantly cut down the overhead with direct tool calling with LLMs inside different agents, especially when making multiple tool calls in a more complex workflow. Inspired by the recent blog post by Cloudflare for their CodeMod MCP server and the original Apple white paper, I hacked together a new MCP server that is a lot better than the Cloudflare server in several ways. One of them being not relying on their backends to isolate the execution of the tool calling but also just generally better support around all the features in MCP and also significantly better interface generation and LLM tool hinting to save on context window tokens. This implementation can also scale to a lot more child servers more cleanly.

Most LLMs are naturally better at code generation than they are at tool calling with code understanding being more foundational to their knowledge and tool calling being pound into models in later stages during fine tuning. It can also burn an excessive number of tokens passing data between tools via LLMs in these agent orchestrators. But if you move the tool calling to be done by code rather than directly by the LLMs in the agents and have the LLMs generate that code, it can produce significantly better results for complex cases and reduce overhead with passing data between tool calls.

This implementation works as an MCP server proxy basically. As an MCP server, it is also an MCP client to your child servers. In the middle it hosts a node VM to execute code generated by the LLM to make tool calls indirectly. By introspecting the child MCP servers and converting their tool call interfaces to small condensed typescript API declarations, your LLM can generate code that invokes these tools in the provided node VM instead of invoking directly and do the complex processing of the response handling and errors in code instead of directly. This can be really powerful with when doing multiple tool calls in parallel or with logic around processing. And since it's a node VM, it has access to standard node models and built in standard libraries there.

One issue is if your tool calls are actually simple, like doing a basic web search or a single tool call, this can a bit more unnecessary overhead. But the more complex the prompt, the more this approach can significantly improve the quality of the output and lower your inference billing costs.