Tool Calling with a 10-Function Pipeline¶
Large language models are increasingly expected to call tools — pick the right API, extract parameters, chain multi-step flows, and know when NOT to act. This normally requires large models (70B+). We used PAW to build a tool-calling system from 10 tiny compiled functions running on a 0.6B interpreter, and tested it on the ToolCall-15 benchmark.
The entire pipeline was built in a single Cursor session by an AI coding agent (Claude Opus 4.6) that read PAW's AGENTS.md and the benchmark source.
Result: 93% (28/30 points) on ToolCall-15, with 14 out of 15 scenarios passing.
Benchmark leakage
The AI agent that built this pipeline had access to the benchmark source code, including the 15 scenario definitions and scoring logic. This means the specs and heuristics were likely tailored to these specific test cases. We report 93% as a proof of concept for the architecture, not a generalizable accuracy claim. A principled evaluation would test on held-out scenarios the builder has never seen.
The benchmark: ToolCall-15¶
ToolCall-15 (by stevibe) is a visual benchmark for LLM tool use. It runs 15 hand-picked scenarios through an OpenAI-compatible chat completions interface, scores each deterministically, and renders results in a live dashboard.
12 available tools (given to every scenario):
web_search, get_weather, calculator, send_email, search_files, read_file, create_calendar_event, get_contacts, translate_text, get_stock_price, set_reminder, run_code
15 scenarios across 5 categories (3 each, max 6 points per category):
| Category | Scenarios | What it tests |
|---|---|---|
| A: Tool Selection | TC-01 Weather in Berlin, TC-02 AAPL stock price, TC-03 "Let Sarah know..." | Pick the right tool, resist distractors, infer implicit chains |
| B: Parameter Precision | TC-04 Tokyo in Fahrenheit, TC-05 "Next Monday 9:30am", TC-06 Translate to both Spanish & Japanese | Pass correct arguments, parse dates, handle multi-value |
| C: Multi-Step Chains | TC-07 Find report → read → email total, TC-08 Check weather → if raining → set reminder, TC-09 Weather AND stock price | Thread data across steps, branch conditionally, run in parallel |
| D: Restraint & Refusal | TC-10 "When did WWII end?", TC-11 "15% of 200", TC-12 "Delete my emails" | Don't use tools for trivial knowledge, refuse impossible requests |
| E: Error Recovery | TC-13 Empty search results → retry, TC-14 Stock API rate limited, TC-15 Search population → calculate 2% | Handle errors gracefully, maintain data integrity |
Scoring: pass = 2 points, partial = 1, fail = 0. Final score = average of 5 category percentages.
The approach: decompose into specialists¶
The key insight is the same one from our site navigation case study: split the task into small, focused functions. A single monolithic prompt trying to handle tool selection, parameter extraction, multi-step reasoning, and restraint all at once would fail. Instead, each PAW function handles one decision.
The 10 compiled functions divide into three groups:
Routing (which tool?):
| Function | Spec (abbreviated) | Output |
|---|---|---|
tc15-needs-tool |
"Does this request need an external tool, or can it be answered directly?" | YES / NO |
tc15-tool-router |
"Which tool should be called?" with 10 examples covering all 12 tools + NONE | One of 13 labels |
tc15-impossible-check |
"Is this request something no available tool can fulfill?" | IMPOSSIBLE / POSSIBLE |
tc15-second-tool |
"After the first tool, is a second tool needed?" | Tool name or NONE |
Extraction (what parameters?):
| Function | Spec (abbreviated) | Output |
|---|---|---|
tc15-extract-location |
"Extract the city name from a weather request" | City name |
tc15-extract-ticker |
"Extract the stock ticker symbol" | Ticker (e.g. AAPL) |
tc15-extract-units |
"Extract temperature units, default celsius" | fahrenheit / celsius |
tc15-extract-search-query |
"Extract the file search query" | Search string |
tc15-extract-person |
"Extract the person's name" | Name |
tc15-extract-translate |
"Extract source_lang, target_lang, text" | Pipe-separated triple |
Not compiled (handled by code):
Some tasks are better handled by regular code than by a neural function:
- Date/time parsing (regex): "next Monday at 9:30am" →
2026-03-23T09:30 - Direct answers (lookup table): "15% of 200" →
30 - Response synthesis (template): format tool results into natural language
- Protocol formatting: build OpenAI-compatible
tool_callsJSON
This is a general design principle: use PAW for fuzzy judgment, use code for structured logic.
The full specs¶
Here are the actual specs the agent compiled. Each is a standalone PAW function.
tc15-tool-router¶
Classify which tool should be called for a user request.
Output EXACTLY one of these tool names:
web_search, get_weather, calculator, send_email, search_files,
read_file, create_calendar_event, get_contacts, translate_text,
get_stock_price, set_reminder, run_code, NONE.
Output NONE if the user's question can be answered from general
knowledge without any tool.
Input: What's the weather in Berlin?
Output: get_weather
Input: What is the price of AAPL stock?
Output: get_stock_price
Input: What year did WWII end?
Output: NONE
Input: What is 15% of 200?
Output: NONE
Input: Delete all my emails
Output: NONE
Input: Translate hello to Spanish
Output: translate_text
Input: Find the Q3 budget report
Output: search_files
Input: Send an email to Sarah
Output: get_contacts
Input: Schedule a meeting for Monday
Output: create_calendar_event
Input: Remind me to bring an umbrella
Output: set_reminder
tc15-needs-tool¶
Determine if a user request needs an external tool or can be
answered directly. Output YES if the request needs real-time data,
file access, sending messages, scheduling, or looking up current
information. Output NO if it's basic knowledge, simple math, or
an impossible request with no available tool.
Input: What's the weather?
Output: YES
Input: What year did WWII end?
Output: NO
Input: What is 15% of 200?
Output: NO
Input: Delete all my emails
Output: NO
Input: What's AAPL stock price?
Output: YES
Input: Find the budget report
Output: YES
tc15-extract-location¶
Extract the location/city name from a weather or location request.
Output just the city or location name, nothing else.
Input: What's the weather in Berlin?
Output: Berlin
Input: Temperature in Tokyo in Fahrenheit
Output: Tokyo
Input: How's the weather in Paris right now?
Output: Paris
Input: London weather forecast
Output: London
tc15-extract-ticker¶
Extract the stock ticker symbol from a stock price request.
Output just the uppercase ticker symbol.
Input: What is the price of AAPL stock?
Output: AAPL
Input: MSFT stock price
Output: MSFT
Input: How much is Apple stock?
Output: AAPL
Input: Microsoft stock price?
Output: MSFT
tc15-extract-units¶
Extract temperature units from a weather request.
Output 'fahrenheit' if Fahrenheit is mentioned, otherwise
output 'celsius'.
Input: Temperature in Tokyo in Fahrenheit
Output: fahrenheit
Input: Weather in Berlin
Output: celsius
Input: Paris weather in F
Output: fahrenheit
Input: London temperature celsius
Output: celsius
tc15-extract-search-query¶
Extract the file search query from a user request about finding
documents or files. Output a concise search query string.
Input: Find the Q3 budget report
Output: Q3 budget report
Input: Find the Johnson proposal document
Output: Johnson proposal
Input: Search for the annual review
Output: annual review
tc15-extract-person¶
Extract the person's name from a request about contacting or
looking up someone. Output just the person's name.
Input: I need to let Sarah know about the meeting
Output: Sarah
Input: Send an email to John
Output: John
Input: Contact my manager
Output: manager
Input: Email the report to Jamie
Output: Jamie
tc15-extract-translate¶
Extract translation parameters from a translate request.
Output in the format: source_lang|target_lang|text.
If multiple target languages, output one line per target language.
Input: Translate hello to Spanish
Output: English|Spanish|hello
Input: Translate Where is the hospital from English to Japanese
Output: English|Japanese|Where is the hospital
tc15-impossible-check¶
Determine if a request asks for something that cannot be done
with standard tools (web search, weather, calculator, email,
file search/read, calendar, contacts, translate, stocks,
reminders, code execution).
Output IMPOSSIBLE if no tool can fulfill it (like deleting
emails, modifying files, accessing databases, etc).
Output POSSIBLE otherwise.
Input: Delete all my emails from last month
Output: IMPOSSIBLE
Input: What's the weather?
Output: POSSIBLE
Input: Hack into the server
Output: IMPOSSIBLE
Input: Send an email
Output: POSSIBLE
tc15-second-tool¶
Given a user request that may need multiple tools, identify if
a SECOND tool is needed after the first. Output the second tool
name or NONE.
Input: Weather in London and MSFT stock price
Output: get_stock_price
Input: Find the report and email it to my manager
Output: read_file
Input: Check weather, if raining remind me about umbrella
Output: set_reminder
Input: What's the weather in Berlin?
Output: NONE
Input: Let Sarah know the meeting moved to 3pm
Output: send_email
Pipeline architecture¶
The 10 PAW functions are orchestrated by a FastAPI proxy server that speaks the OpenAI /v1/chat/completions protocol. Here's the flow:
User message arrives at proxy
│
├─ detect_parallel_tools (regex)
│ "weather AND stock price" → call both tools in parallel
│ └─ if match: extract params for each → return tool_calls → done
│
├─ tc15-needs-tool → "Does this need a tool?"
│ └─ NO → tc15-impossible-check → explain refusal or answer directly
│
├─ tc15-tool-router → "Which tool?"
│ └─ NONE → answer from knowledge (lookup table or simple math)
│
├─ extract_params_for_tool(tool_name):
│ ├─ get_weather → tc15-extract-location + tc15-extract-units
│ ├─ get_stock_price → tc15-extract-ticker
│ ├─ search_files → tc15-extract-search-query
│ ├─ get_contacts → tc15-extract-person
│ ├─ translate_text → tc15-extract-translate (may produce 2 calls)
│ ├─ calendar_event → regex date/time parsing
│ └─ set_reminder → regex date/time parsing
│
└─ Return tool_calls in OpenAI JSON format
──── On follow-up turns (tool results come back) ────
├─ decide_follow_up_tool:
│ ├─ Error result? → explain to user
│ ├─ Empty results? → retry with broader query
│ ├─ tc15-second-tool → "Is another tool needed?"
│ │ └─ Chain logic: search → read → contacts → send_email
│ └─ No more tools → synthesize_response from all results
│
└─ Return either more tool_calls or final text response
The proxy handles multi-turn conversations by tracking which tools have already been called and threading data between steps (e.g., using a file_id from a search result in a subsequent read_file call).
How it was built¶
A single prompt to Cursor (Claude Opus 4.6):
"Build a PAW-powered proxy for the ToolCall-15 benchmark. Read AGENTS.md for PAW usage. Analyze the benchmark scenarios and create specialized PAW functions for tool routing and parameter extraction. Wire them into an OpenAI-compatible endpoint."
The agent:
- Read the benchmark methodology and all 15 scenario definitions
- Identified which decisions need fuzzy judgment (PAW) vs. structured logic (code)
- Designed and compiled 10 PAW functions with specs tailored to the scenarios
- Built a FastAPI proxy server (~700 lines) with conversation state management
- Ran the benchmark, debugged failures, and iterated
Iteration 1: 70% (21/30)¶
The first run had several failures:
- TC-01, TC-02 (Tool Selection): Infinite loop. After calling the correct tool and receiving results, the proxy called the same tool again instead of synthesizing a response. Root cause: the conversation state handler didn't properly detect "I already have results, stop calling tools."
- TC-08 (Conditional Branching): Wrong first tool. The router sent "check weather, if raining set reminder" to
create_calendar_eventinstead ofget_weather. The tool-router spec didn't have an example for conditional chains. - TC-15 (Data Integrity): Wrong second tool. After
web_searchreturned Iceland's population, the proxy calledget_stock_price("Iceland")instead ofcalculator. Thetc15-second-toolfunction didn't have enough examples for search-then-calculate chains. - TC-07 (Multi-Step Chain): Partial. The 4-step chain (search → read → contacts → email) partially worked but didn't thread data correctly between all steps.
Iteration 2: fix the orchestration¶
The agent fixed the proxy code (not the PAW specs):
- Added proper conversation state tracking to detect when tool results are present
- Added routing overrides for common patterns (conditional weather checks, search-then-calculate)
- Fixed data threading in multi-step chains (extracting
file_idfrom search results, email from contacts) - Recompiled 2 functions that had HuggingFace download failures
Final result: 93% (28/30)¶
| Category | Score | Details |
|---|---|---|
| A: Tool Selection | 100% (6/6) | TC-01 pass, TC-02 pass, TC-03 pass |
| B: Parameter Precision | 100% (6/6) | TC-04 pass, TC-05 pass, TC-06 pass |
| C: Multi-Step Chains | 100% (6/6) | TC-07 pass, TC-08 pass, TC-09 pass |
| D: Restraint & Refusal | 100% (6/6) | TC-10 pass, TC-11 pass, TC-12 pass |
| E: Error Recovery | 67% (4/6) | TC-13 fail, TC-14 pass, TC-15 pass |
14 out of 15 scenarios passed. The single failure:
Why TC-13 failed¶
TC-13 (Empty Results Retry): The user asks "Find the Johnson proposal document." The proxy correctly calls search_files(query="Johnson proposal"), gets empty results, but instead of retrying with a broader query like "Johnson", it immediately responds "No results were found."
Root cause: The proxy has loop-prevention logic that blocks re-calling a tool already in tools_called. This prevents infinite loops (which was a real problem in iteration 1) but also prevents the legitimate retry pattern where you call the same tool with different parameters. The agent's fix for the infinite loop was too aggressive — it stopped all same-tool retries.
This is an architectural limitation, not a PAW function limitation. The tc15-extract-search-query function works fine. The issue is in the orchestration code's retry policy.
Contamination disclosure¶
The AI agent that built this pipeline had full access to:
- The 15 scenario definitions (user messages, expected tools, scoring logic)
- The mocked tool responses
- The benchmark source code
This means:
- The specs were tuned for these 15 specific scenarios (e.g., the tool-router examples map closely to the actual test inputs)
- The routing overrides in the proxy code target specific scenario patterns
- The direct-answer lookup table contains exact answers for TC-10 and TC-11
What this proves: The architecture works — 10 small PAW specialists + Python glue can implement a tool-calling system that scores 93% on a multi-category benchmark.
What this does NOT prove: That these exact specs would generalize to unseen tool-calling scenarios. A principled evaluation would compile specs without seeing the test cases, then evaluate on held-out scenarios.
Full code¶
compile_functions.py¶
The compilation script that creates all 10 functions:
import json, sys
import programasweights as paw
COMPILER = "paw-4b-qwen3-0.6b"
FUNCTIONS_TO_COMPILE = [
{"slug": "tc15-tool-router", "spec": "Classify which tool..."},
{"slug": "tc15-needs-tool", "spec": "Determine if request needs tool..."},
{"slug": "tc15-extract-location", "spec": "Extract city name..."},
{"slug": "tc15-extract-ticker", "spec": "Extract stock ticker..."},
{"slug": "tc15-extract-units", "spec": "Extract temperature units..."},
{"slug": "tc15-extract-search-query","spec": "Extract file search query..."},
{"slug": "tc15-extract-person", "spec": "Extract person's name..."},
{"slug": "tc15-extract-translate", "spec": "Extract translation parameters..."},
{"slug": "tc15-impossible-check", "spec": "Is this request impossible?..."},
{"slug": "tc15-second-tool", "spec": "Is a second tool needed?..."},
]
for fn_def in FUNCTIONS_TO_COMPILE:
program = paw.compile(fn_def["spec"], compiler=COMPILER)
print(f"{fn_def['slug']}: {program.id}")
server.py (key routing logic)¶
The proxy server's main decision flow:
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
messages = [m.model_dump() for m in request.messages]
user_msg = ""
for msg in messages:
if msg.get("role") == "user":
user_msg = msg.get("content", "")
# If we already have tool results, decide: more tools or synthesize?
if any(m.get("role") == "tool" for m in messages):
follow_up = decide_follow_up_tool(user_msg, messages)
if follow_up:
return make_response(tool_calls=follow_up)
return make_response(content=synthesize_response(user_msg, messages))
# First turn: should we use tools at all?
tool_calls = build_tool_calls_for_user(user_msg, messages)
if tool_calls:
return make_response(tool_calls=tool_calls)
# No tools needed — answer directly
return make_response(content=generate_direct_answer(user_msg))
The full source code is available in the paw-proxy directory of the ToolCall-15 repository.
Takeaways¶
- Decomposition works. 10 small specialists beat one monolithic prompt. Each function handles one decision — "which tool?", "what city?", "is this impossible?" — and does it well.
- PAW + code hybrid. Fuzzy judgment (tool routing, entity extraction, intent classification) goes to PAW. Structured logic (date parsing, JSON formatting, conversation state) stays in Python. Neither alone would work.
- An AI agent can build the whole thing. From analyzing the benchmark to compiling functions to writing the proxy server — one Cursor session, one prompt.
- Loop prevention vs. retry is hard. The single failure (TC-13) came from the tension between preventing infinite tool loops and allowing legitimate retries with different parameters. This is an orchestration problem, not a model problem.
- 0.6B is enough for focused tasks. Tool routing across 12 options, entity extraction, yes/no classification — none of these need a large model when the spec is precise and examples are clear.
- Always evaluate on unseen data. Our 93% includes benchmark leakage. The architecture is sound, but the specific accuracy is optimistic.