Progress in LLMs

Structured Output

January 3, 2026

A second major improvement that has taken place over this same time period was the rise of LLM-based agents. Earlier LLM-based chatbots could generate text, but because they were traind on text data, they would often respond in natural language. This severely limited their ability to interact with external software.

Turning these chatbots into agents that could use external tools required that they could consistently respond with correctly structured machine-readable outputs. Back in 2021 people had already experimented with LLMs that could control calculators, search tools, or other small computer scripts. Developers would ask models like GPT-3 to structure their natural language output into a format like JSON outputs, so that their answers could be used by computer code to determine which tools and input parameters would have to be used during the next steps.

But this came initially with many difficulties. The LLM responses were full of hallucinations in the commands that would be sent to the software. The JSON responses would contain hallucinated commands that weren't available, contain elements that were not properly nested, missed closing braces, or the model would contnue generating plain text again mid-response. Early workflow automation prototypes relied heavily on regex to cleanup the model outputs, and make the output processable by computer programms

En Example

To explain this in more detail: Let's say we want to let our chatbot use a calculator when it is trying to make a calculation. To work with the calculator our chatbot will have to be able to do several things: (1) call the calculator, (2) tell which numbers it wants to add, subtract, multiply or divide. For this to work we need the chatbot to respond the above information structured together into a well structured object. A JSON object is one of the ways in which we can organise such a structured response, and it can for example look like this:

{
  "function_name": "calculator",
  "input": {
    "number_1": 15,
    "number_2": 3,
    "operation": "multiply"
  }
}

We can then write a piece of software that checks what the "function_name" field is when the chatbot replies with a JSON object, and sends the input fields for "number_1", "operation", and "number_2" correctly to the calculator, and shows the output from the calculator back to the user.

But the real problem is how do we make sure the LLM responds always with responses that are exactly in this JSON format structure when it seems that the answer to a user's question would invorlve a calculation. One naive way is to include the above example into a system prompt that might look something like this: "... When you think a calculation needs to be made, return a JSON object like the above example, hwere you specify two numbers as input, together with the operation that should be performed on these two numbers:" but this might yield output like this:

{
  function: "calculator'',
  'input': [
    "number_1``: 15,
    'number_2": 3
    `operation`: "multiply 15 * 3
  ]
)

LLMs frequently hallucinated missing or mismatched brackets, mismatched quotes, trailing commas, or drifting back into natural language mid-response.

These errors in the JSON structure they returned made it nearly impossible for software to rely an LLM’s output. Even worse, Models were unaware of what tools actually existed, and would routinely hallucinate nonexistent tools (“download_pdf_from_url”), dream up non-existing parameters, or call functions with the wrong structure. Early automation tools resorted running the LLM model again and again until it managed to create a JSON output that using additional layers of error-correction, could be structured into a correct format. Often having to repeatedly call the LLM-model until it responded with a correct function call made it more expensive and slow to run these agentic models.

Function Calling and Structured Outputs

In June 2023, OpenAI had started to fine-tune the GPT-4 model, so that when a user defined a list of available functions with their required inputs, the model would be more likely to respond with an existing function name, and valid corresponding inputs. This significantly reduced the problem of the LLM comming up with non-existant tool names, or making syntax errors in the JSON structure. But it was not perfect. Still it allowed developers to more reliably define a set of “tools” / “functions” that their application supported, and let the LLM respond with either a call to one of these functions, or still respond in natural language.

Though this new 'Native Function Calling' approach was a first step in the right direction, it was still imperfect. It improved the chances a model would respond with a well-defined function call, but the models still occasionally hallucinated and called nonexistent functions, or use the wrong parameter structure when trying to call them.

As a next step, by late 2023, OpenAI introduced 'JSON mode', a new setting that forced the GPT-4 model to respond exclusively in terms of valid JSONs (without any natural-language output). The idea of 'Tool Use' by LLMs began to spreads across the industry, and other labs were working on similar implementations that would allow LLMs to choose when a tool should be used instead of responding by simply returning plain text.

By early–mid 2024, multiple labs introduced similar 'Structured Output' formats, allowing developers to define a full JSON schema that the model would be required to obey in its response. By late 2023 and early 2024, all major labs adopted similar mechanisms. Restructuring what the model should do (using reasoning) now became one of the tools a model could call. Gemini could output a structured “plan object” before performing a task. And many labs started to work on multi-turn, multi-tool agent loops.

This shift finally made LLMs usable as reliable data-extraction and data-generation engines, and not just chat systems that would come up with random words. Instead of producing prose, models could now return machine-readable representations of anything. Agents became capable of working across files, running tools, checking results, and iterating until a task was complete.

Model Context Protocol

But a deeper problem still remained: every lab had come up with its own way to define the JSON schemas that have to be given to the LLM when it is running in 'structured output' mode

This meant that if you built a application that uses an LLM to interacted with a tool such as our calculator example, you had to rewrite how the JSON schema is given to the model every time you switched to a different LLM model. Depending on whether they used OpenAI, Anthropic, Google, or an open-source model. This made it difficult to switch the LLM that a workflow was build on top of. Companies had to write and maintain multiple adaptors (to define each tool their application could use) for each model. Every time a new LLM was released, developers potentially had to rewrite the definitions of every individual tool their workflow could use.

In late 2024, Anthropic introduced the Model Context Protocol (MCP). Other labs were quick to adapt this format, and MCP became the universal standard to define how LLM models can interact with external applications, files, databases, etc. MCP made tools portable across models: the same definition to use a calculator tool could now be used unchanged by any of the latest models that support MCP.

Where function calling solved JSON correctness and tool invocation, MCP provided a universal standard that made tools definitions portable across models. Developers could now build complex LLM-based systems that could call external tools, without having to rewrite the definitions of every tool their application could use when a new LLM model would come out. Instead they could simply reuse their old tool definitions. It also allowed existing software (like databases, and ...) to specify the MCP that would allow any LLM-based system could potentially use their tool.

This allowed a boom of agent ecosystems. LLM-based agents could use increasingly more tools to browse files, run commands on PCs, query databases, and edit entire projects.

Continue reading:Tool Use