Tool Calling and Agent Capabilities for LLMs

This document provides a comprehensive overview of tool calling and agent capabilities for Large Language Models (LLMs), covering basic approaches, research foundations, advanced techniques, and practical implementations.

Introduction to LLM Agents
Foundations of Tool Calling
Basic Approaches
Function Calling
ReAct: Reasoning and Acting
Tool-Augmented LLMs
Advanced Approaches
Model Context Protocol (MCP)
Agentic Workflows
Multi-Agent Systems
Tool Learning
Framework Implementations
OpenAI
LangChain
LlamaIndex
Semantic Kernel
AutoGen
CrewAI
Technical Deep Dive
Function Calling Implementation
MCP Implementation
Evaluation and Benchmarks
Future Directions
References

Introduction to LLM Agents

LLM Agents are systems that combine the reasoning capabilities of large language models with the ability to interact with external tools and environments. This combination enables LLMs to go beyond text generation and perform actions in the real world or digital environments.

An LLM agent typically consists of:

A large language model: Provides reasoning, planning, and natural language understanding
Tool interfaces: Allow the LLM to interact with external systems
Orchestration layer: Manages the flow between the LLM and tools
Memory systems: Store context, history, and intermediate results
Planning mechanisms: Enable multi-step reasoning and task decomposition

Foundations of Tool Calling

Research Papers

"Language Models as Zero-Shot Planners" (2022)
Paper Link
Introduced the concept of using LLMs for planning tasks without specific training
Demonstrated that LLMs can break down complex tasks into steps
"ReAct: Synergizing Reasoning and Acting in Language Models" (2023)
Paper Link
Combined reasoning traces with actions in a synergistic framework
Showed improved performance on tasks requiring both reasoning and tool use
"ToolFormer: Language Models Can Teach Themselves to Use Tools" (2023)
Paper Link
Demonstrated self-supervised learning of tool use by LLMs
Introduced a method for LLMs to learn when and how to call external APIs
"HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face" (2023)
Paper Link
Proposed a framework for LLMs to orchestrate specialized AI models
Demonstrated task planning, model selection, and execution coordination
"Gorilla: Large Language Model Connected with Massive APIs" (2023)
Paper Link
Focused on teaching LLMs to use APIs accurately
Introduced techniques for improving API call precision

Basic Approaches

Function Calling

Reference Links: - OpenAI Function Calling Documentation - Anthropic Tool Use Documentation

Motivation: Enable LLMs to interact with external systems in a structured way.

Implementation: Function calling allows LLMs to generate structured JSON outputs that conform to predefined function schemas. The basic workflow is:

Define functions with JSON Schema
Send the function definitions to the LLM along with a prompt
The LLM decides whether to call a function and generates the appropriate arguments
The application executes the function with the provided arguments
Function results are sent back to the LLM for further processing

Example:

# Define a weather function
weather_function = {
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g., San Francisco, CA"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit"
                }
            },
            "required": ["location"]
        }
    }
}

# Call the model with the function definition
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What's the weather like in Boston?"}],
    tools=[weather_function],
    tool_choice="auto"
)

# Extract and execute the function call
tool_calls = response.choices[0].message.tool_calls
if tool_calls:
    # Execute the function
    function_name = tool_calls[0].function.name
    function_args = json.loads(tool_calls[0].function.arguments)

    # Call your actual weather API here
    weather_data = get_weather_data(function_args["location"], function_args.get("unit", "celsius"))

    # Send the results back to the model
    messages = [
        {"role": "user", "content": "What's the weather like in Boston?"},
        response.choices[0].message,
        {
            "role": "tool",
            "tool_call_id": tool_calls[0].id,
            "name": function_name,
            "content": json.dumps(weather_data)
        }
    ]

    final_response = client.chat.completions.create(
        model="gpt-4",
        messages=messages
    )

    print(final_response.choices[0].message.content)

Popularity: Very high. Function calling is supported by most major LLM providers and frameworks.

Drawbacks: - Limited to predefined function schemas - Requires careful schema design to ensure proper use - May struggle with complex, multi-step reasoning

ReAct: Reasoning and Acting

Reference Links: - ReAct Paper - LangChain ReAct Implementation

Motivation: Combine reasoning traces with actions to improve performance on tasks requiring both thinking and doing.

Implementation: ReAct prompts the LLM to generate both reasoning traces and actions in an interleaved manner:

Thought: The LLM reasons about the current state and what to do next
Action: The LLM selects a tool and provides arguments
Observation: The environment returns the result of the action
This cycle repeats until the task is complete

Example:

from langchain.agents import create_react_agent
from langchain.agents import AgentExecutor
from langchain.tools import Tool
from langchain_openai import ChatOpenAI

# Define tools
tools = [
    Tool(
        name="Search",
        func=lambda query: search_engine(query),
        description="Search the web for information"
    ),
    Tool(
        name="Calculator",
        func=lambda expression: eval(expression),
        description="Evaluate mathematical expressions"
    )
]

# Create the agent
llm = ChatOpenAI(model="gpt-4")
prompt = create_react_agent(llm, tools, prompt=REACT_PROMPT)
agent = AgentExecutor(agent=prompt, tools=tools, verbose=True)

# Run the agent
result = agent.invoke({"input": "What is the population of France divided by the square root of 2?"})

Popularity: High. ReAct is widely implemented in agent frameworks and has become a standard approach.

Drawbacks: - Can be verbose and token-intensive - May struggle with very complex reasoning chains - Requires careful prompt engineering

ReAct vs Function Calling: A Comparison

Feature	ReAct	Function Calling
Format	Generates reasoning traces and actions in natural language	Produces structured JSON outputs conforming to predefined schemas
Reasoning Visibility	Explicit reasoning is visible in the output	Reasoning happens internally and isn't visible
Structure	Less structured, more flexible	Highly structured, less flexible
Token Usage	Higher (due to reasoning traces)	Lower (only essential function parameters)
Error Handling	Can self-correct through reasoning	Requires explicit error handling in the application
Tool Discovery	Can discover tools through exploration	Limited to predefined function schemas
Implementation Complexity	Requires more prompt engineering	Requires careful schema design
Best For	Complex reasoning tasks, exploration	Structured API interactions, precise tool use

Tool-Augmented LLMs

Reference Links: - ToolFormer Paper - Gorilla Paper

Motivation: Train LLMs to use tools more effectively through specialized fine-tuning.

Implementation: Tool-augmented LLMs are specifically trained or fine-tuned to use external tools:

Create a dataset of tool usage examples
Fine-tune the LLM on this dataset
The resulting model learns when and how to use tools appropriately

Example:

Gorilla's approach to API calling:

from gorilla import GorillaChatCompletion

# Define the API you want to use
api_schema = {
    "name": "text_to_speech",
    "description": "Convert text to speech audio",
    "parameters": {
        "text": "The text to convert to speech",
        "voice": "The voice to use (male, female)",
        "speed": "The speed of the speech (0.5-2.0)"
    }
}

# Call Gorilla with the API schema
response = GorillaChatCompletion.create(
    model="gorilla-mpt-7b",
    messages=[{"role": "user", "content": "Convert 'Hello world' to speech using a female voice"}],
    apis=[api_schema]
)

# The response will contain a properly formatted API call
api_call = response.choices[0].message.content
print(api_call)
# Output: text_to_speech(text="Hello world", voice="female", speed=1.0)

Popularity: Medium. Tool-augmented LLMs are growing in popularity but require specialized models.

Drawbacks: - Requires specific fine-tuned models - Less flexible than general-purpose approaches - May not generalize well to new tools

Advanced Approaches

LangGraph: A Graph-Based Agent Framework

Reference Links: - LangGraph Documentation - LangGraph GitHub Repository

Motivation: Enable the creation of stateful, multi-step agent workflows with explicit control flow and state management.

Implementation: LangGraph extends LangChain's agent capabilities with a graph-based approach:

State Management: Explicit state objects that persist across steps
Graph-Based Workflows: Define agent behavior as a directed graph of nodes and edges
Conditional Branching: Dynamic decision-making based on agent outputs
Cyclical Processing: Support for loops and recursive reasoning
Human-in-the-Loop: Seamless integration of human feedback

Example:

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage

# Define the state schema
class AgentState(TypedDict):
    messages: List[Union[HumanMessage, AIMessage]]
    next_step: Optional[str]

# Create a graph
graph = StateGraph(AgentState)

# Define nodes
def generate_response(state):
    messages = state["messages"]
    llm = ChatOpenAI(model="gpt-4")
    response = llm.invoke(messages)
    return {"messages": messages + [response]}

def decide_next_step(state):
    messages = state["messages"]
    llm = ChatOpenAI(model="gpt-4")
    response = llm.invoke(
        messages + [
            HumanMessage(content="What should be the next step? Options: [research, calculate, finish]")
        ]
    )
    decision = response.content.strip().lower()
    return {"next_step": decision}

def research(state):
    # Implement research functionality
    return {"messages": state["messages"] + [AIMessage(content="Research completed.")]}

def calculate(state):
    # Implement calculation functionality
    return {"messages": state["messages"] + [AIMessage(content="Calculation completed.")]}

# Add nodes to the graph
graph.add_node("generate_response", generate_response)
graph.add_node("decide_next_step", decide_next_step)
graph.add_node("research", research)
graph.add_node("calculate", calculate)

# Define edges
graph.add_edge("generate_response", "decide_next_step")
graph.add_conditional_edges(
    "decide_next_step",
    lambda state: state["next_step"],
    {
        "research": "research",
        "calculate": "calculate",
        "finish": END
    }
)
graph.add_edge("research", "generate_response")
graph.add_edge("calculate", "generate_response")

# Compile the graph
agent_executor = graph.compile()

# Run the agent
result = agent_executor.invoke({"messages": [HumanMessage(content="Analyze the impact of AI on healthcare.")]})

Key Differences from Traditional Agents:

Explicit vs. Implicit Control Flow: LangGraph makes the agent's decision-making process explicit through graph structure, while traditional agents rely on the LLM to manage control flow implicitly.
State Management: LangGraph provides robust state management, allowing complex state to persist across steps, whereas traditional agents often have limited state persistence.
Composability: LangGraph enables easy composition of multiple agents and tools into complex workflows, making it more suitable for enterprise applications.
Debugging and Visualization: The graph structure makes it easier to debug and visualize agent behavior compared to traditional black-box agents.
Deterministic Routing: LangGraph allows for deterministic routing between steps based on explicit conditions, reducing the unpredictability of LLM-based control flow.

Popularity: Medium but rapidly growing. LangGraph is becoming the preferred approach for complex agent workflows in the LangChain ecosystem.

Drawbacks: - Higher complexity compared to simpler agent frameworks - Steeper learning curve - Requires more boilerplate code - Still evolving with frequent API changes

Model Context Protocol (MCP)

Reference Links: - Model Context Protocol (MCP)

Motivation: Standardize the way context, tools, and memory are injected into LLM prompts.

Implementation: MCP provides a structured JSON-based protocol for context injection:

Define a context bundle with various components (memory, tools, etc.)
Send the bundle to an MCP server
The server processes the bundle and constructs an optimized prompt
The prompt is sent to the LLM for processing

Example:

# Send a request to the MCP server
import requests

context_bundle = {
    "user_input": "What's the weather like in Paris?",
    "memory": {
        "enable": True,
        "k": 5,  # Number of memories to retrieve
        "filter": {"type": "conversation"}
    },
    "tools": [
        {
            "name": "get_weather",
            "description": "Get weather information for a location",
            "parameters": {
                "location": "The city name",
                "unit": "Temperature unit (celsius/fahrenheit)"
            }
        }
    ]
}

response = requests.post("http://localhost:8000/mcp/context", json=context_bundle)
enhanced_prompt = response.json()["prompt"]

# Send the enhanced prompt to an LLM
# ...

Popularity: Low to Medium. MCP is a newer approach but gaining traction for standardizing context injection.

Drawbacks: - Requires additional server infrastructure - Less standardized than other approaches - May add latency to the request pipeline

Agentic Workflows

Reference Links: - LangChain Agents - BabyAGI - AutoGPT

Motivation: Enable LLMs to perform complex, multi-step tasks through autonomous planning and execution.

Implementation: Agentic workflows combine planning, tool use, and memory:

The LLM creates a plan for solving a complex task
It breaks the plan into subtasks
For each subtask, it selects and uses appropriate tools
Results are stored in memory and used to inform subsequent steps
The process continues until the task is complete

Example:

from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory

# Define tools
tools = [
    Tool(
        name="Search",
        func=lambda query: search_engine(query),
        description="Search the web for information"
    ),
    Tool(
        name="Calculator",
        func=lambda expression: eval(expression),
        description="Evaluate mathematical expressions"
    ),
    Tool(
        name="WeatherAPI",
        func=lambda location: get_weather(location),
        description="Get weather information for a location"
    )
]

# Set up memory
memory = ConversationBufferMemory(memory_key="chat_history")

# Create the agent
llm = ChatOpenAI(model="gpt-4")
agent = initialize_agent(
    tools, 
    llm, 
    agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION,
    memory=memory,
    verbose=True
)
#CHAT_CONVERSATIONAL_REACT_DESCRIPTION: this is an extended version of ReAct that supports conversation and memory, making it suitable for the more complex workflows of Agentic Workflows. It uses the Thought-Action-Observation cycle but adds memory persistence and conversational abilities.

# Run the agent on a complex task
result = agent.run(
    "Plan a day trip to Paris. I need to know the weather, top 3 attractions, "
    "and calculate a budget of 200 euros divided among these activities."
)

Popularity: High. Agentic workflows are widely used for complex task automation.

Drawbacks: - Can be computationally expensive - May struggle with very long-horizon planning - Requires careful tool design and error handling

Implementation Links: - LangChain Thought-Action-Observation Implementation - ReAct Agent Loop in LangChain - Agent Executor Implementation

Agentic Workflows vs ReAct: A Comparison

Feature	ReAct	Agentic Workflows
Scope	Focused on single-task reasoning and execution	Designed for complex, multi-step tasks with planning
Planning	Limited planning, focuses on immediate next steps	Explicit planning phase to break down complex tasks
Memory	Typically stateless or with simple memory	Integrated memory to track progress across subtasks
Autonomy	Semi-autonomous with human oversight	Higher autonomy for extended task sequences
Complexity	Better for focused, well-defined tasks	Better for open-ended, complex problem-solving
Structure	Rigid Thought-Action-Observation cycle	Flexible workflow with planning, execution, and reflection phases
Task Decomposition	Limited task decomposition	Explicit task decomposition into subtasks
Resource Usage	Moderate token usage	Higher token usage due to planning overhead
Best For	Single queries requiring reasoning and tool use	Complex tasks requiring multiple steps and planning

Multi-Agent Systems

Reference Links: - AutoGen - CrewAI - Multi-Agent Collaboration Paper

Motivation: Distribute complex tasks among specialized agents for more effective problem-solving.

Implementation: Multi-agent systems involve multiple LLM agents with different roles:

Define specialized agents with different roles and capabilities
Create a communication protocol between agents
Implement a coordination mechanism (e.g., a manager agent)
Allow agents to collaborate on complex tasks

Example:

from autogen import AssistantAgent, UserProxyAgent, config_list_from_json

# Configure agents
config_list = config_list_from_json("OAI_CONFIG_LIST")

# Create a research agent
researcher = AssistantAgent(
    name="Researcher",
    llm_config={"config_list": config_list},
    system_message="You are a research expert. Find and analyze information on topics."
)

# Create a coding agent
coder = AssistantAgent(
    name="Coder",
    llm_config={"config_list": config_list},
    system_message="You are a Python expert. Write code to solve problems."
)

# Create a user proxy agent
user_proxy = UserProxyAgent(
    name="User",
    human_input_mode="TERMINATE",
    max_consecutive_auto_reply=10,
    code_execution_config={"work_dir": "coding"}
)

# Start a group chat
user_proxy.initiate_chat(
    researcher,
    message="Research the latest machine learning techniques for time series forecasting "
            "and then have the coder implement a simple example."
)

Popularity: Medium to High. Multi-agent systems are gaining popularity for complex tasks.

Drawbacks: - Complex to set up and manage - Can be expensive due to multiple LLM calls - May suffer from coordination issues - Potential for agents to get stuck in loops

Tool Learning

Reference Links: - ToolFormer Paper - TALM Paper

Motivation: Enable LLMs to learn when and how to use tools through self-supervised learning.

Implementation: Tool learning involves training LLMs to recognize when tools are needed:

Create a dataset of problems and their solutions using tools
Fine-tune the LLM on this dataset
The model learns to identify situations where tools are helpful
It also learns the correct syntax and parameters for tool calls

Example:

ToolFormer's approach:

# Example of a ToolFormer-generated response with tool calls

# Input: "What is the capital of France and what's the current temperature there?"

# ToolFormer output:
"The capital of France is Paris. [TOOL:Weather(location="Paris, France")] The current temperature in Paris is 18°C."

# This output includes a tool call that would be parsed and executed by the system

Popularity: Medium. Tool learning is an active research area but not yet widely deployed.

Drawbacks: - Requires specialized training data - May not generalize well to new tools - Less flexible than runtime tool definition approaches

Framework Implementations

OpenAI

Reference Links: - OpenAI Function Calling - OpenAI Assistants API - OpenAI Responses API

Key Features: - Native function calling in chat completions API - Assistants API with built-in tool use - Responses API combining strengths of both previous APIs - Support for code interpreter, retrieval, and function calling - Parallel function calling in newer models - Server-side state management in Responses and Assistants APIs

Example:

from openai import OpenAI
import json

client = OpenAI()

# Define functions
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g., San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Call the model
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather like in Boston and Tokyo?"}],
    tools=tools,
    tool_choice="auto"
)

# Process tool calls
message = response.choices[0].message
tool_calls = message.tool_calls

if tool_calls:
    # Process each tool call
    tool_call_messages = [message]

    for tool_call in tool_calls:
        function_name = tool_call.function.name
        function_args = json.loads(tool_call.function.arguments)

        # Call your actual function here
        function_response = get_weather(function_args["location"], function_args.get("unit", "celsius"))

        tool_call_messages.append({
            "tool_call_id": tool_call.id,
            "role": "tool",
            "name": function_name,
            "content": json.dumps(function_response)
        })

    # Get the final response
    second_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What's the weather like in Boston and Tokyo?"}] + tool_call_messages
    )

    print(second_response.choices[0].message.content)

Popularity: Very high. OpenAI's implementation is widely used and well-documented.

Drawbacks: - Requires OpenAI API access - Can be expensive for complex agent workflows - Limited to predefined function schemas

OpenAI Responses API vs. Chat Completions vs. Assistants

Feature	Chat Completions API	Assistants API	Responses API
State Management	Client-side (must send full conversation history)	Server-side (threads)	Server-side (simpler than Assistants)
Function/Tool Calling	Basic support	Advanced support	Advanced support with simplified workflow
Built-in Tools	Limited	Code interpreter, retrieval, function calling	Web search, file search, function calling
Conversation Flow	Manual orchestration	Complex (threads, messages, runs)	Simplified with previous_response_id
Implementation Complexity	Higher for complex workflows	Highest	Lowest
Longevity	Indefinite support promised	Being sunset (2026)	Current focus
Best For	Simple interactions, custom workflows	Complex agents (legacy)	Modern agent development

Responses API Example:

from openai import OpenAI
import json

client = OpenAI()

# Define functions
tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather in a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g., San Francisco, CA"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit"
                }
            },
            "required": ["location"]
        }
    }
]

# Initial request with function definition
response = client.responses.create(
    model="gpt-4o",
    input="What's the weather like in Boston and Tokyo?",
    tools=tools,
    store=True  # Enable server-side state management
)

# Process tool calls
for tool_call in response.tool_calls:
    if tool_call.type == "function" and tool_call.function.name == "get_weather":
        args = json.loads(tool_call.function.arguments)
        location = args["location"]
        unit = args.get("unit", "celsius")

        # Call your actual function here
        weather_data = get_weather(location, unit)

        # Submit tool output back to the model
        client.responses.tool_outputs.create(
            response_id=response.id,
            tool_outputs=[
                {
                    "tool_call_id": tool_call.id,
                    "output": json.dumps(weather_data)
                }
            ]
        )

# Get the final response with all tool outputs processed
final_response = client.responses.retrieve(response_id=response.id)
print(final_response.output_text)

# Continue the conversation using previous_response_id
follow_up = client.responses.create(
    model="gpt-4o",
    input="How does that compare to Miami?",
    previous_response_id=response.id  # Reference previous conversation
)

LangChain

Reference Links: - LangChain Agents - LangChain Tools

Key Features: - Multiple agent types (ReAct, Plan-and-Execute, etc.) - Extensive tool library - Memory integration - Support for various LLM providers - Agent executors for managing agent-tool interaction

Example:

from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
from langchain_openai import ChatOpenAI

# Load tools
tools = load_tools(["serpapi", "llm-math"], llm=ChatOpenAI(temperature=0))

# Initialize agent
agent = initialize_agent(
    tools, 
    ChatOpenAI(temperature=0), 
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Run the agent
agent.run("Who is the current US president? What is their age raised to the 0.43 power?")

Popularity: Very high. LangChain is one of the most popular frameworks for building LLM agents.

Drawbacks: - Can be complex to set up for advanced use cases - Documentation can be challenging to navigate - Frequent API changes

LlamaIndex

Reference Links: - LlamaIndex Agents - LlamaIndex Tools

Key Features: - Integration with retrieval-augmented generation (RAG) - Query engines as tools - OpenAI Assistants API integration - Function calling support - Agent executors similar to LangChain

Example:

from llama_index.core.tools import FunctionTool
from llama_index.agent.openai import OpenAIAgent
from llama_index.core.query_engine import QueryEngine
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Define a simple tool
def multiply(a: int, b: int) -> int:
    """Multiply two integers and return the result."""
    return a * b

multiply_tool = FunctionTool.from_defaults(fn=multiply)

# Create a RAG query engine
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

# Create an agent with tools
agent = OpenAIAgent.from_tools(
    [multiply_tool, query_engine],
    verbose=True
)

# Run the agent
response = agent.chat("What information is in my documents? Also, what is 123 * 456?")
print(response)

Popularity: High. LlamaIndex is popular especially for RAG-based agents.

Drawbacks: - More focused on retrieval than general agent capabilities - Less extensive tool library than LangChain - Documentation can be sparse for advanced use cases

Semantic Kernel

Reference Links: - Semantic Kernel - SK Function Calling

Key Features: - Plugin architecture for tools - Native .NET and Python support - Semantic functions and native functions - Planning capabilities - Memory integration

Example:

import semantic_kernel as sk
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion

# Create a kernel
kernel = sk.Kernel()

# Add OpenAI service
kernel.add_chat_service("chat-gpt", OpenAIChatCompletion("gpt-4"))

# Define a native function
@sk.kernel_function
def get_weather(location: str) -> str:
    """Get the weather for a location."""
    # In a real scenario, call a weather API here
    return f"It's sunny in {location} with a temperature of 72°F."

# Register the function
kernel.add_function(get_weather)

# Create a semantic function
prompt = """{{$input}}\n\nAnswer the user's question. If you need to know the weather, use the get_weather function."""
function = kernel.create_semantic_function(prompt, max_tokens=2000, temperature=0.7)

# Run the function
result = function.invoke("What's the weather like in Seattle?")
print(result)

Popularity: Medium. Semantic Kernel is growing in popularity, especially in Microsoft ecosystem.

Drawbacks: - Less mature than LangChain or OpenAI's solutions - Smaller community and fewer examples - Documentation can be technical and dense

AutoGen

Reference Links: - AutoGen - AutoGen Multi-Agent Collaboration

Key Features: - Multi-agent conversation framework - Customizable agent roles and capabilities - Code generation and execution - Human-in-the-loop interactions - Conversational memory

Example:

from autogen import AssistantAgent, UserProxyAgent, config_list_from_json

# Load LLM configuration
config_list = config_list_from_json("OAI_CONFIG_LIST")

# Create an assistant agent
assistant = AssistantAgent(
    name="Assistant",
    llm_config={"config_list": config_list},
    system_message="You are a helpful AI assistant."
)

# Create a user proxy agent with code execution capability
user_proxy = UserProxyAgent(
    name="User",
    human_input_mode="TERMINATE",
    code_execution_config={"work_dir": "coding", "use_docker": False}
)

# Start a conversation
user_proxy.initiate_chat(
    assistant,
    message="Create a Python function to calculate the Fibonacci sequence up to n terms."
)

Popularity: Medium and growing. AutoGen is gaining traction for multi-agent systems.

Drawbacks: - Steeper learning curve than some alternatives - More complex to set up - Less extensive documentation and examples

CrewAI

Reference Links: - CrewAI - CrewAI Documentation

Key Features: - Role-based agent framework - Process-oriented workflows - Task delegation and management - Agent collaboration patterns - Human-in-the-loop capabilities

Example:

from crewai import Agent, Task, Crew
from crewai.tools import SerperDevTool

# Create a search tool
search_tool = SerperDevTool()

# Create agents with specific roles
researcher = Agent(
    role="Senior Research Analyst",
    goal="Uncover cutting-edge developments in AI",
    backstory="You are an expert in analyzing AI research papers and trends",
    tools=[search_tool],
    verbose=True
)

writer = Agent(
    role="Technical Writer",
    goal="Create engaging content about AI developments",
    backstory="You transform complex technical concepts into accessible content",
    verbose=True
)

# Define tasks for each agent
research_task = Task(
    description="Research the latest developments in large language models",
    agent=researcher,
    expected_output="A comprehensive report on recent LLM advancements"
)

writing_task = Task(
    description="Write a blog post about the latest LLM developments",
    agent=writer,
    expected_output="A 500-word blog post about LLM advancements",
    context=[research_task]
)

# Create a crew with the agents and tasks
crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    verbose=2
)

# Execute the crew's tasks
result = crew.kickoff()
print(result)

Popularity: Medium but rapidly growing. CrewAI is newer but gaining popularity for role-based agents.

Drawbacks: - Newer framework with less community support - Limited tool integrations compared to more established frameworks - Documentation is still evolving

Technical Deep Dive

Function Calling Implementation

Function calling in LLMs involves several key technical components:

JSON Schema Definition: Functions are defined using JSON Schema, which provides a structured way to describe the function's parameters and return values.
Prompt Engineering: The LLM needs to be prompted in a way that encourages it to use the provided functions when appropriate. This often involves system prompts that instruct the model to output JSON when calling tools. Implementation examples:
OpenAI Function Calling System Prompt Example
Anthropic Tool Use System Prompt
LangChain Tool Calling Templates

Example system prompt for JSON tool calling:

You are a helpful assistant with access to tools. When you need to use a tool, respond in the following JSON format:
{"tool": "tool_name", "parameters": {"param1": "value1", "param2": "value2"}}

If you don't need to use a tool, respond normally. Always use proper JSON with double quotes for both keys and string values.

Output Parsing: The LLM's output needs to be parsed to extract function calls and their arguments.
Function Execution: The extracted function calls need to be executed in the application environment.
Result Integration: The results of the function execution need to be integrated back into the conversation.

Here's a detailed look at how function calling is implemented in the OpenAI API:

# 1. Define the function schema
function_schema = {
    "type": "function",
    "function": {
        "name": "get_stock_price",
        "description": "Get the current stock price for a company",
        "parameters": {
            "type": "object",
            "properties": {
                "symbol": {
                    "type": "string",
                    "description": "The stock symbol, e.g., AAPL for Apple"
                }
            },
            "required": ["symbol"]
        }
    }
}

# 2. Send the request to the API with the function definition
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What's the current stock price of Apple?"}],
    tools=[function_schema],
    tool_choice="auto"
)

# 3. Parse the response to extract function calls
message = response.choices[0].message
tool_calls = message.tool_calls

if tool_calls:
    # 4. Execute the function
    function_call = tool_calls[0].function
    function_name = function_call.name
    function_args = json.loads(function_call.arguments)

    # Call the actual function
    if function_name == "get_stock_price":
        stock_price = get_real_stock_price(function_args["symbol"])

    # 5. Send the function result back to the API
    second_response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "user", "content": "What's the current stock price of Apple?"},
            message,
            {
                "role": "tool",
                "tool_call_id": tool_calls[0].id,
                "name": function_name,
                "content": json.dumps({"price": stock_price, "currency": "USD"})
            }
        ]
    )

    # Final response with the information
    final_response = second_response.choices[0].message.content
    print(final_response)

Under the hood, the LLM has been trained to:

Recognize when a function would be useful for answering a query
Generate a properly formatted function call with appropriate arguments
Incorporate the function results into its response

This is typically implemented through fine-tuning on function calling examples or through few-shot learning in the prompt.

ReAct Implementation

ReAct (Reasoning and Acting) is a powerful paradigm that combines reasoning traces with actions. Here's a detailed look at how ReAct is implemented in LangChain:

from langchain.agents import AgentType, initialize_agent, load_tools
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory

# Initialize the language model
llm = ChatOpenAI(temperature=0)

# Load tools
tools = load_tools(["serpapi", "llm-math"], llm=llm)

# Set up memory
memory = ConversationBufferMemory(memory_key="chat_history")

# Create the ReAct agent
agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.REACT_DOCSTORE,  # Using the ReAct agent type
    verbose=True,
    memory=memory
)

# Run the agent
response = agent.run(
    "What was the high temperature in SF yesterday? What is that number raised to the .023 power?"
)

Under the hood, LangChain's ReAct implementation works through these key components:

Prompt Template: A specialized prompt that instructs the LLM to follow the Thought-Action-Observation pattern
Output Parser: Parses the LLM's output to extract the thought, action, and action input
Tool Execution: Executes the specified action with the provided input
Agent Loop: Continues the cycle until a final answer is reached

Implementation Links: - LangChain ReAct Agent Source Code - ReAct Prompt Templates - Agent Executor Implementation

The ReAct implementation demonstrates how structured reasoning can be combined with tool use to create more effective agents.

MCP Implementation

Motivation: The Model Context Protocol (MCP) was developed to address several key challenges in LLM applications:

Standardization: Different LLM providers and frameworks use different formats for context injection, making it difficult to switch between them.
Optimization: Naively injecting context can lead to token wastage and reduced performance.
Modularity: Applications often need to combine multiple types of context (memory, tools, etc.) in a flexible way.
Scalability: As applications grow more complex, managing context becomes increasingly challenging.

How It Works: MCP provides a standardized way to inject context, tools, and memory into LLM prompts. Here's a technical overview of how MCP works:

Context Bundle: The client creates a context bundle containing the user input, memory configuration, tools, and other context.
MCP Server: The bundle is sent to an MCP server, which processes it and constructs an optimized prompt.
Prompt Construction: The server uses templates and plugins to construct a prompt that includes the relevant context and tools.
LLM Processing: The constructed prompt is sent to the LLM for processing.
Response Parsing: The LLM's response is parsed to extract tool calls and other structured information. This often relies on system prompts that instruct the model to output in specific JSON formats when using tools. See MCP JSON Response Format Example for implementation details.

Internal Implementation: The MCP architecture consists of several key components:

Protocol Definition: Standardized schemas for context bundles, tools, memory, and other components. These schemas define the structure and format of data exchanged between clients and the MCP server, ensuring consistency and interoperability across different implementations. The protocol includes definitions for message formats, parameter types, and response structures that facilitate seamless communication between components.
Semantic Kernel Protocol Implementation
LangChain Protocol Implementation
Server Implementation: A FastAPI server that processes context bundles and constructs prompts. The server receives context bundles from clients, applies optimization algorithms to select relevant context, constructs prompts using templates, and manages the communication with LLM providers. It handles authentication, rate limiting, caching, and other infrastructure concerns to ensure reliable and efficient operation.
Semantic Kernel Server Implementation
Plugin System: Extensible plugins for different types of context (memory, tools, etc.). Plugins are modular components that can be dynamically loaded to extend the functionality of the MCP server. Each plugin type handles a specific aspect of context processing, such as retrieving relevant memories, defining available tools, or incorporating domain-specific knowledge. The plugin architecture allows for easy customization and extension without modifying the core server code.
Semantic Kernel Plugin System
Client Libraries: Libraries for different programming languages to interact with MCP servers. These libraries provide high-level abstractions and utilities for creating context bundles, sending them to MCP servers, and processing the responses. They handle serialization, error handling, retries, and other client-side concerns to simplify integration with applications. Client libraries are available for multiple programming languages to support diverse development environments.
Semantic Kernel Python Client

Framework Adoption:

Semantic Kernel: Microsoft's Semantic Kernel has fully embraced MCP as its core architecture.
Status: Production-ready, actively maintained
Semantic Kernel MCP Documentation
LangChain: LangChain has implemented some MCP concepts but with its own variations.
Status: Partial adoption, evolving
LangChain Schema Documentation
LlamaIndex: LlamaIndex has begun adopting MCP-like concepts for context management.
Status: Early adoption, experimental
LlamaIndex Context Management
Custom Implementations: Many organizations are implementing custom MCP-like systems.
Status: Varied, from experimental to production

Future Directions: MCP is evolving in several key directions:

Standardization: Efforts to create a cross-framework standard for context injection
Optimization: More sophisticated context selection and prompt construction algorithms
Multimodal Support: Extending MCP to handle images, audio, and other modalities
Distributed Architecture: Scaling MCP to handle large-scale applications

Here's a simplified implementation of an MCP server:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Dict, Any, Optional
import json

app = FastAPI()

class MemoryConfig(BaseModel):
    enable: bool = True
    k: int = 5
    filter: Optional[Dict[str, Any]] = None

class Tool(BaseModel):
    name: str
    description: str
    parameters: Dict[str, Any]

class ContextBundle(BaseModel):
    user_input: str
    memory: Optional[MemoryConfig] = None
    tools: Optional[List[Tool]] = None
    additional_context: Optional[Dict[str, Any]] = None

class PromptResponse(BaseModel):
    prompt: str
    context_used: Dict[str, Any]

@app.post("/mcp/context", response_model=PromptResponse)
async def process_context(bundle: ContextBundle):
    # Initialize the prompt components
    prompt_parts = []
    context_used = {}

    # Add system instructions
    prompt_parts.append("You are a helpful AI assistant.")

    # Add memory if enabled
    if bundle.memory and bundle.memory.enable:
        # In a real implementation, this would retrieve relevant memories
        memories = retrieve_memories(bundle.user_input, bundle.memory.k, bundle.memory.filter)
        if memories:
            prompt_parts.append("\nRelevant context from memory:")
            for memory in memories:
                prompt_parts.append(f"- {memory}")
            context_used["memories"] = memories

    # Add tools if provided
    if bundle.tools:
        prompt_parts.append("\nYou have access to the following tools:")
        for tool in bundle.tools:
            prompt_parts.append(f"\n{tool.name}: {tool.description}")
            prompt_parts.append(f"Parameters: {json.dumps(tool.parameters, indent=2)}")
        context_used["tools"] = [t.name for t in bundle.tools]

        # Add instructions for tool usage
        prompt_parts.append("\nTo use a tool, respond with:")
        prompt_parts.append('{"tool": "tool_name", "parameters": {"param1": "value1"}}\n')

    # Add additional context if provided
    if bundle.additional_context:
        for key, value in bundle.additional_context.items():
            prompt_parts.append(f"\n{key}: {value}")
        context_used["additional_context"] = list(bundle.additional_context.keys())

    # Add the user input
    prompt_parts.append(f"\nUser: {bundle.user_input}")
    prompt_parts.append("\nAssistant:")

    # Combine all parts into the final prompt
    final_prompt = "\n".join(prompt_parts)

    return PromptResponse(prompt=final_prompt, context_used=context_used)

def retrieve_memories(query: str, k: int, filter_config: Optional[Dict[str, Any]]):
    # In a real implementation, this would query a vector database
    # For this example, we'll return dummy memories
    return ["This is a relevant memory", "This is another relevant memory"]

This implementation demonstrates the core concepts of MCP:

Standardized context bundle format
Modular prompt construction
Memory integration
Tool definition and usage instructions
Additional context injection

The actual implementation would include more sophisticated memory retrieval, tool handling, and prompt optimization.

Evaluation and Benchmarks

Evaluating LLM agents is challenging due to the complexity and diversity of tasks they can perform. Several benchmarks and evaluation frameworks have emerged:

AgentBench

Reference Link: AgentBench Paper

AgentBench evaluates agents on eight diverse tasks:

Operating System Interaction
Database Querying
Knowledge Graph Querying
Web Browsing
Digital Card Game Playing
Embodied Household Tasks
Open-Domain Question Answering
Web Shopping

Results show that even advanced models like GPT-4 achieve only 54.2% success rate, highlighting the challenges in building effective agents.

ToolBench

Reference Link: ToolBench Paper

ToolBench focuses specifically on tool use capabilities:

Tool Selection: Choosing the right tool for a task
Parameter Filling: Providing correct parameters
Tool Composition: Using multiple tools together
Error Recovery: Handling errors in tool execution

The benchmark includes 16,464 tasks involving 248 real-world APIs.

ReAct Benchmark

Reference Link: ReAct Paper

The ReAct benchmark evaluates agents on:

HotpotQA: Multi-hop question answering
FEVER: Fact verification
WebShop: Web shopping simulation
ALFWorld: Household tasks in a text environment

Results show that ReAct outperforms standard prompting and chain-of-thought approaches.

Key Metrics

When evaluating LLM agents, several key metrics are important:

Task Completion Rate: Percentage of tasks successfully completed
Efficiency: Number of steps or API calls needed to complete a task
Accuracy: Correctness of the final result
Robustness: Performance under different conditions or with unexpected inputs
Cost: Computational and financial cost of running the agent

Future Directions

Multimodal Agents

Future agents will increasingly incorporate multimodal capabilities:

Vision for understanding images and videos
Audio for speech recognition and generation
Tactile feedback for robotic applications

This will enable more natural and comprehensive interactions with the physical world.

Agentic Memory

Advanced memory systems will enhance agent capabilities:

Episodic memory for remembering past interactions
Procedural memory for learning and improving skills
Semantic memory for storing knowledge
Working memory for handling complex reasoning tasks

Autonomous Learning

Agents will become more capable of learning from experience:

Self-improvement through reflection
Learning new tools and APIs
Adapting to user preferences
Discovering new strategies for problem-solving

Multi-Agent Ecosystems

Complex systems of specialized agents will emerge:

Hierarchical organization with manager and worker agents
Collaborative problem-solving
Market-based allocation of tasks
Emergent behaviors from agent interactions

Alignment and Safety

Ensuring agents act in accordance with human values will be crucial:

Constitutional AI approaches
Human feedback mechanisms
Sandboxed execution environments
Monitoring and intervention systems

References

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). ToolFormer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761.
Shen, Y., Jiang, Y., Kalyan, A., Rajani, N., Aggarwal, K., Zhou, B., Mooney, R., & Bansal, M. (2023). HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv:2303.17580.
Patil, S., Peng, B., Shen, Y., Zhou, X., Liang, P., Salakhutdinov, R., & Ren, X. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334.
Huang, W., Xie, S. M., Stein, S. A., Metz, L., Shrivastava, A., Freeman, C. D., & Dyer, E. (2022). Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. arXiv:2201.07207.
Qin, Y., Liang, W., Ye, H., Zhong, V., Zhuang, Y., Li, X., Cui, Y., Gu, N., Liu, X., & Jiang, N. (2023). ToolBench: Towards Evaluating and Enhancing Tool Manipulation Capabilities of Large Language Models. arXiv:2307.16789.
Liu, Q., Yao, S., Chen, F., Wang, C., Brohan, A., Xu, J., Zeng, A., Zhao, J., Ahn, M., Yan, W., Peng, B., Duan, N., & Russakovsky, O. (2023). AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688.
Wu, C., Hou, S., Zhao, Z., Xu, C., & Yin, P. (2023). TALM: Tool Augmented Language Models. arXiv:2306.05301.
Qian, W., Patil, S. A., Peng, B., Bisk, Y., Zettlemoyer, L., Gupta, S., Kembhavi, A., & Schwing, A. (2023). Communicative Agents for Software Development. arXiv:2307.07924.
Hong, X., Xiong, Z., Xiao, C., Boyd-Graber, J., & Daumé III, H. (2023). Cognitive Architectures for Language Agents. arXiv:2309.02427.

Tool Calling and Agent Capabilities for LLMs

Table of Contents

Introduction to LLM Agents

Foundations of Tool Calling

Research Papers

Basic Approaches

Function Calling

ReAct: Reasoning and Acting

ReAct vs Function Calling: A Comparison

Tool-Augmented LLMs

Advanced Approaches

LangGraph: A Graph-Based Agent Framework

Model Context Protocol (MCP)

Agentic Workflows

Agentic Workflows vs ReAct: A Comparison

Multi-Agent Systems

Tool Learning

Framework Implementations

OpenAI

OpenAI Responses API vs. Chat Completions vs. Assistants

LangChain

LlamaIndex

Semantic Kernel

AutoGen

CrewAI

Technical Deep Dive

Function Calling Implementation

ReAct Implementation

MCP Implementation

Evaluation and Benchmarks

AgentBench

ToolBench

ReAct Benchmark

Key Metrics

Future Directions

Multimodal Agents

Agentic Memory

Autonomous Learning

Multi-Agent Ecosystems

Alignment and Safety

References