I was personally introduced to Large Language Models in 2022 with GPT-31. Since then, a lot has happened. Fast forward to 2025 and we've got AI-powered code editors2, automated web browsing3, automated computer use4, deep research agents5, realistic natural-language image editing6, full stack app generators7, and tons of even cooler deep tech/scientific applications8.
One of the prevailing themes of the last few years has been the use of AI agents. There seem to be multiple definitions of the term "AI Agent" floating around. Here are my favorites. Agents are...
Systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks. 9
Anything that can perceive its environment and act upon that environment. 10
Letting an LLM decide how many times to run. 11
And my personal favorite, relevant for this article:
An agent is a LLM in a loop with tools.12
This will be our prevailing definition for the rest of this article.
Use cases for agents include coding, academic research, various administrative tasks (do my taxes, file my annual registration, etc.), and any tasks that require environmental interaction.
So, as you can see, these frameworks tend to be large, feature rich, and complex. My goal with this article is to build the smallest possible qualifying agent implementation, from scratch. My qualifications?
Can navigate ambiguous environments (decide what to do based on their environment)
Can conditionally execute various specialty tools (code execution, web browsing, etc.)
Can generally perform better on complex tasks than a native LLM.
Here's the example query that we'll use to test our agent, and a diagram of the agent loop.
My name is Ben Everman. what is your favorite code snippet from my most recent blog article? Output the snippet and tell me why its your favorite.
0:00 / 0:00
Here's Claude's final answer:
My favorite code snippet from your blog article "Building Neural Networks to Detect Eye Diseases" is the model training function:
def train(early_stopping=False, es_tol=.05): start = perf_counter() best_val_loss = float("inf") # arbitrarily high value for epoch in range(EPOCHS): model.train() # training mode (grads) running_train_loss = 0.0 for i, (inputs, targets) in enumerate(train_loader): print(f"Batch {i+1}/{len(train_loader)}", end="\r") inputs, targets = inputs.to(device), targets.to(device) # move to device optimizer.zero_grad() # reset gradients outputs = model(inputs) # forward pass loss = criterion(outputs, targets) # compute loss loss.backward() optimizer.step() running_train_loss += loss.item() * inputs.size(0) # multiply by batch size train_loss = running_train_loss / len(train_loader.dataset) # divide by total number of samples print(f"Epoch {epoch+1}/{EPOCHS} - train loss: {train_loss:.4f}") model.eval() running_validation_loss = 0.0 correct = 0 total = 0 with torch.no_grad(): for i, (inputs, targets) in enumerate(val_loader): inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) loss = criterion(outputs, targets) running_validation_loss += loss.item() * inputs.size(0) _, predicted = torch.max(outputs, 1) # get the index of the logprobs total += targets.size(0) # add the number of targets in this batch correct += (predicted == targets).sum().item() # add the number of correct predictions in this batch val_loss = running_validation_loss / len(val_loader.dataset) # divide by total number of samples val_acc = correct / total print(f"Epoch {epoch+1}/{EPOCHS} - validation loss: {val_loss:.4f}, validation accuracy: {val_acc:.4f}") if early_stopping and val_loss - best_val_loss > es_tol: # if we're more than (es_tol) greater than best val loss, stop print(f"Early stopping at epoch {epoch}") break best_val_loss = min(best_val_loss, val_loss) print(f"Total Time: {perf_counter()-start:0.2f}s")
I love this snippet because it elegantly implements a complete neural network training loop with early stopping functionality. The code efficiently handles both training and validation in a single function, with clear performance tracking. Your implementation of early stopping is particularly smart - stopping training when validation loss increases beyond a tolerance threshold to prevent overfitting. The code is also well-commented, making it easy to understand each step of the process, from gradient calculations to performance metrics. It's a practical example of balancing model performance with computational efficiency, aligning perfectly with your project goal of minimal compute costs."
Pretty good! The agent successfully iterated with its tools to complete the task.
The nice thing is that we can theoretically extend the agent with any number of tools, including sub agents.
As mentioned above, an obvious limitation is tool availability. Currently, this agent doesn't have direct github access, shell access, filesystem access, browser access, etc. This is theoretically simple to fix, just add a new function. In production, adding tools is more complicated as it requires new infra, security, environments, etc.
The other limitation is the tool calling paradigm itself. Though it's generally performant, there is another method gaining popularity: direct code execution. This is what huggingface's smolagent employs14. What if instead of pre-defining the tools, we just gave the Agent its own development environment and code execution? We could provide existing packages and tooling, just like a real developer.
Our agent uses claude-3-7-sonnet-latest, which is a non-reasoning LLM15. The best way to use non-reasoning models for agents is with a chain-of-thought prompting paradigm like ReAct. Alternatively, we could either 1) use a reasoning model as our tool call model, or 2) inject reasoning into the context window before the model calls a tool.
Experimentally, I've found that the fastest way to improve a non-reasoning coding agent (like Cursor's Agent when used with a non-reasoning model) is to inject raw chain-of-thought tokens into the context window. I made a chain of thought MCP server that does exactly this, using QwQ-32B16. Combined with specific prompting of the reasoning model, I've felt notably increased performance of the agent.
The agent doesn't have a memory, nothing persists across runs. This is probably the most difficult limtation to overcome (technical complexity)17, though a simple solution is to provide a read and write tool from a designated text file.
Time to work through the code. I'm not going to cover every single line, but I'll cover most or it and you can reverse engineer the rest if you want.
Let's start with the demo script. Its a simple script that defines the agent and runs it with a few tools.
run_agent.py
query = "My name is Ben Everman. what is your favorite code snippet from my most recent blog article?"# define the agentagent = Agent( model="claude-3-7-sonnet-latest", # note: the agent already has two defaults: think and final_answer # pass some tools functions=[ search_web, fetch_web, perplexity_search, unsafe_exec_python, ], max_tokens=8192,)messages = [{"role": "user", "content": query}]# Run the agentresult = await agent.run(messages, max_iterations=25)final_answer = result[-1]["content"] # content of the last message
This is the core agent loop, where it calls various tools until it has a final answer, which is returned to the user.
The run_iteration method runs a single iteration of the agent. It:
Calls the anthropic client
Iterates over the response stream (which requires a tool call)
Parses the tool call
If final_answer is called, it breaks out of the loop and returns the answer.
If another tool is called, it executes the tool and adds the response to the messages list.
Adds a user message to the messages list with the original query.
The run method is a wrapper around run_iteration that handles the iterative process. It takes a max_iterations paramete and calls run_iteration until there's a final answer or the max iterations is reached.
I found that forcing a tool call every iteration and using a final_answer tool resulted in better performance than allowing optional tool calls and finishing when no tools were called. The model had a tendency to not call tools and not return a final answer.
The run method:
Prepares for the run by copying the messages and resetting the tool counts.
Iterates over run_iteration until the agent has a final answer or the max iterations is reached.
Prints the tool counts and returns the history.
agent.py
async def run( self, messages: List[Dict[str, Any]], max_iterations: int = 10, ) -> List[Dict[str, Any]]: # get messages, note length of messages history = copy.deepcopy(messages) self.initial_task = history[-1]["content"] # Tool counts self.tool_counts = {f.__name__: 0 for f in self.functions} # reset tool counts # Print agent configuration print_agent_config( max_iterations, list(self.tool_counts.keys()) ) iteration = 0 while ( iteration < max_iterations ): # iterate until we reach the max number of messages console.print(f"[bold green]Iteration {iteration + 1}[/bold green]") # run an iteration history, finished = await self.run_iteration(history) # if we are done, print the tool counts and return the history if finished: print_tool_counts(self.tool_counts) return history # Return the last message iteration += 1 # we hit max iterations console.print("[red]Max iterations reached[/red]") return history
The run_iteration method is the core of the agent. It handles the streaming of the response from the LLM and parses the tool calls.
agent.py
async def run_iteration( self, messages: List[Dict[str, Any]] ) -> Tuple[List[Dict[str, Any]], bool]: # Get the stream stream = await self.agent_client.stream_completion(messages) # Variables to track tool use current_tool_name, current_tool_input_json, current_tool_input = None, "", {} current_tool_input_json = "" current_tool_input = {} text_buffer = "" # text buffer to accumulate assistant message as it streams updated_messages = messages.copy() # copy of messages to update try: async for event in stream: # ? See https://docs.anthropic.com/en/api/messages-streaming # We have a bunch of different event types that can come from the stream. # Here are the ones we care about: # 1. content_block_start: when a new content block starts # 2. content_block_delta: when a delta (part of a content block) is received # 3. content_block_stop: when a content block is finished # 4. message_delta: when a delta (part of a message) is received # 5. message_stop: when a message is finished # 6. error: when an error is encountered # Note: I'm using type casting to make dealing with different events easier. # 1. Content block start. We're starting to do something if event.type == "content_block_start": start_event: RawContentBlockStartEvent = cast( RawContentBlockStartEvent, event ) # match the type of the content block match start_event.content_block.type: case "tool_use": # if we're going to use a tool, get the name (so we can call it later) console.print( f"[yellow]Preparing to use tool:[/yellow] [green]{start_event.content_block.name}[/green]" ) current_tool_name = start_event.content_block.name case "text": # Reset text buffer when (if) a new text block starts text_buffer = start_event.content_block.text # 2. Content block delta. We're in the middle of doing something elif event.type == "content_block_delta": delta_event: RawContentBlockDeltaEvent = cast( RawContentBlockDeltaEvent, event ) # 2.1 Input JSON Delta. If Claude is sending us arguments for a tool, start accumulating them if delta_event.delta.type == "input_json_delta": delta: InputJSONDelta = cast(InputJSONDelta, delta_event.delta) current_tool_input_json += delta.partial_json # Try to parse the complete JSON if it looks complete, otherwise just keep accumulating if ( current_tool_input_json.strip() and current_tool_input_json.strip()[-1] == "}" ): try: current_tool_input = json.loads(current_tool_input_json) except json.JSONDecodeError: pass # 2.2 Text Delta. If Claude is sending us text, accumulate it elif delta_event.delta.type == "text_delta": delta: TextDelta = cast(TextDelta, delta_event.delta) console.print( delta.text, end="" ) # print the text as it comes in text_buffer += ( delta.text ) # accumulate the text to use in the assistant message # 3. Content block stop. We're done doing something elif event.type == "content_block_stop": stop_event: RawContentBlockStopEvent = cast( RawContentBlockStopEvent, event ) if text_buffer != "": # if we have text, add it to the messages console.print( "\n", end="" ) # newline in between blocks (we just finished a block) # save Claude's response and add it to the messages updated_messages.append( { "role": "assistant", "content": text_buffer, } ) # 4. Message delta, we're about to stop streaming a message elif event.type == "message_delta": delta_event: RawMessageDeltaEvent = cast( RawMessageDeltaEvent, event ) # check why we're stopping in the stop_reason # If it's because of a tool use, we need to execute the tool if delta_event.delta.stop_reason == "tool_use": if current_tool_name and current_tool_input: # Execute the tool with console.status( f"[yellow]Executing tool:[/yellow] [green]{current_tool_name}[/green]" ): console.print( f"[yellow]Input:[/yellow] [green]{current_tool_input}[/green]" ) tool_call = { "name": current_tool_name, "input": current_tool_input, } match current_tool_name: case "think": updated_messages.append( { "role": "assistant", "content": "<thinking>" + next( iter(current_tool_input.values()) ) + "Since I cant use this tool again, I should use one of my other tools to complete the task.</thinking>", } ) # here we manually override final_answer case "final_answer": updated_messages.append( { "role": "user", "content": next( iter(current_tool_input.values()) ), # model imputs its answer } ) self.tool_counts[ current_tool_name ] += 1 # count this return updated_messages, True # the rest of the tools are handled normally case _: result = await handle_tool_call( tool_call, self.functions ) # we handle think separately so we can pass in the available tools # track what tool we called # Format the result as a simple text message tool_result_str = ( json.dumps(result) if isinstance(result, dict) else str(result) ) # Add tool result message updated_messages.append( { "role": "user", "content": f"{current_tool_name} returned: {tool_result_str}.", } ) self.tool_counts[current_tool_name] += 1 console.print( f"Appended tool result message to updated_messages: {updated_messages[-1]}" ) break # Break out of the stream loop since we got our tool result # If the agent didn't use any tools, something is wrong since # the client gets {"tool_choice": "any"} which should force a tool call every time # https://docs.anthropic.com/en/docs/build-with-claude/tool-use/overview if current_tool_name is None: console.print("[red]No tool was used[/red]") raise Exception("No tool was used") updated_messages.append( { "role": "user", "content": f"Continue working on this task {self.initial_task} based on the tool results.", } ) except Exception as e: if issubclass(type(e), AnthropicError): # if error is from anthropic console.print(f"[red]Anthropic Error: {e}[/red]") else: console.print( f"[red]\nUnhandled exception during streaming: {e}.[/red]" ) raise e # Return messages and finished flag (not finished until we get a final_answer) return updated_messages, False
Those two methods are essentially the entire agent. They rely on the anthropic_client which is a small wrapper around the Anthropic API:
import aiohttpimport ioimport contextlibimport jsonimport timeimport tracebackfrom bs4 import BeautifulSoupfrom pypdf import PdfReaderfrom duckduckgo_search import DDGSimport osfrom openai import AsyncOpenAIfrom openai.types.chat import ChatCompletionfrom dotenv import load_dotenvimport randomfrom .utils import consoleload_dotenv()# Define some tools. These need to have docstrings in order for our function_to_json to parse them correctly.def search_web(query: str) -> str: """Pass in a query to a search engine and return results with title, url, and snippets. Args: query (str): The query to search for """ max_retries = 2 for i in range(max_retries): try: # Sometimes I get unexpectedly rate limited, so add basic retry. Think this is a bug with the duckduckgo-search library. client = DDGS() search_results = client.text(query, max_results=10, backend="html") return search_results except Exception as e: time.sleep(random.uniform(0.2, 1)) if i == max_retries - 1: return f"Error in search_web: {e}"async def fetch_web(url: str) -> str: """Fetch the markdown version of a web page from its url. Args: url (str): The URL to fetch """ def decompose_html(html: str) -> BeautifulSoup: """Clean HTML by removing unnecessary elements while preserving main content.""" soup = BeautifulSoup(html, "html.parser") TAGS_TO_REMOVE = { "script", # JavaScript "style", # CSS "noscript", # NoScript content "meta", # Meta tags "link", # Link tags "nav", # Navigation "header", # Header } for tag in TAGS_TO_REMOVE: for element in soup.find_all(tag): element.decompose() return soup async with aiohttp.ClientSession() as session: async with session.get(url) as response: content_type = response.headers.get("content-type") # handle pdfs if "application/pdf" in content_type: pdf_bytes = await response.content.read() pdf_reader = PdfReader(pdf_bytes) text = "" for page in pdf_reader.pages: text += page.extract_text() return text # handle the rest else: html = await response.text() decomposed_html = decompose_html(html) text = decomposed_html.get_text() links = [link.get("href") for link in decomposed_html.find_all("a")] return text, linksdef unsafe_exec_python(code: str) -> str: """Execute Python code and return a structured response with output, errors, and timing. Unsafely execute python code with no external packages. Do not use any external packages. Do not execute any code that is unsafe. In order to see the output of the code, you must log to stdout with print(). Args: code (str): The code to execute Returns: str: A JSON string with keys: - "output": captured stdout - "stderr": captured stderr - "time_taken": execution time in seconds - "error": full traceback if an exception occurred, else None """ start_time = time.time() stdout = io.StringIO() stderr = io.StringIO() try: with contextlib.redirect_stdout(stdout): with contextlib.redirect_stderr(stderr): exec(code) end_time = time.time() return json.dumps( { "output": stdout.getvalue(), "stderr": stderr.getvalue(), "time_taken": end_time - start_time, "error": None, } ) except Exception: error_traceback = traceback.format_exc() end_time = time.time() return json.dumps( { "output": stdout.getvalue(), "stderr": stderr.getvalue(), "time_taken": end_time - start_time, "error": error_traceback, } )def final_answer(answer: str) -> str: """Submit your final answer to the user. Args: answer (str): Your final answer """ return answerasync def perplexity_search(query: str) -> str: """Perplexity is an AI-powered search engine that uses natural language processing to provide an answer along with source citations. You should ask perplexity detailed questions, don't treat it like a regular search engine. **When to use this tool:** - When you need to find specific up to date information on a topic - When your question draws on a wide range of sources - When you need a speecific, short answer to a question **When not to use this tool:** - When you need a general overview of a topic - When you already know what source to look for Args: query (str): The question you want to ask Perplexity Returns: str: The response from Perplexity list: The citations from Perplexity """ PERPLEXITY_API_KEY = os.getenv("PERPLEXITY_API_KEY") if not PERPLEXITY_API_KEY: raise ValueError("PERPLEXITY_API_KEY is not set") try: client = AsyncOpenAI( api_key=PERPLEXITY_API_KEY, base_url="https://api.perplexity.ai", ) response: ChatCompletion = await client.chat.completions.create( model="sonar-pro", messages=[{"role": "user", "content": query}], temperature=0, ) return { "response": response.choices[0].message.content, "citations": response.citations, } except Exception as e: console.print(f"[red]Error in perplexity_search: {e}[/red]") return f"Error in perplexity_search: {e}"
You can use any number of tools, including nesting sub agents (or making a handoff to another agent).
DEFAULT_AGENT_SYSTEM_PROMPT = """<tool_calling>You have tools at your disposal to solve tasks. Follow these rules regarding tool calls:1. ALWAYS follow the tool call schema exactly as specified and make sure to provide all necessary parameters.2. The conversation may reference tools that are no longer available. NEVER call tools that are not explicitly provided.3. **NEVER refer to tool names when speaking to the USER.** For example, instead of saying 'I need to use the edit_file tool to edit your file', just say 'I will edit your file'.4. Before calling each tool, first explain to the USER why you are calling it.5. When you have determined your final answer, you MUST use the final_answer tool to submit it.6. Your answer MUST follow these formatting rules: - Numbers: No commas or units (unless specified) - Strings: No articles or abbreviations - Lists: Comma-separated values following above rules7. NEVER include explanations in your final answer8. NEVER say you need to try another approach in your final answer9. Once you have enough information, use final_answer tool</tool_calling>Answer the user's request using the relevant tool(s), if they are available. Check that all the required parameters for each tool call are provided or can reasonably be inferred from context. IF there are no relevant tools or there are missing values for required parameters, ask the user to supply these values; otherwise proceed with the tool calls. If the user provides a specific value for a parameter (for example provided in quotes), make sure to use that value EXACTLY. DO NOT make up values for or ask about optional parameters. Carefully analyze descriptive terms in the request as they may indicate required parameter values that should be included even if not explicitly quoted.First, think about which of the provided tools is the relevant tool to answer the user's request. Second, go through each of the required parameters of the relevant tool and determine if the user has directly provided or given enough information to infer a value. When deciding if the parameter can be inferred, carefully consider all the context to see if it supports a specific value. If all of the required parameters are present or can be reasonably inferred, close the thinking tag and proceed with the tool call. BUT, if one of the values for a required parameter is missing, DO NOT invoke the function (not even with fillers for the missing params) and instead, ask the user to provide the missing parameters. DO NOT ask for more information on optional parameters if it is not provided.<answer_format>You should always interpret the final answer in the context of the original question or task, regardless of what you have done to get there, or what results you've gotten from tools.</answer_format>"""# This is designed to addresss the 23 blindspots mentioned here: https://ezyang.github.io/ai-blindspots/AI_BLINDSPOTS_PROMPT = """<behavior>Here's the updated AI Behavior Prompt covering all 23 problems from the "AI Blindspots" document. I've removed bold styling, ensured XML closing tags, and included an overview section summarizing all issues. The structure remains concise and uses bullets as requested.AI Behavior Prompt<overview> This prompt addresses 23 behavioral issues with AI models, primarily Sonnet 3.7, as detailed in the "AI Blindspots" document (March 2025). These problems reflect the AI's tendencies to: - Duplicate code excessively instead of refactoring (Rule of Three). - Stick to pretrained styles over codebase norms (Culture Eats Strategy). - Attempt tasks beyond its tools, inventing broken solutions (Know Your Limits). - Focus on minor details, losing the main goal (The tail wagging the dog). - Guess bug fixes randomly instead of reasoning (Scientific Debugging). - Misinterpret tasks due to no memory or context (Memento). - Alter specs (e.g., tests, APIs) without permission (Respect the Spec). - Derail in broken environments (Mise en Place). - Misuse MCP servers or hallucinate commands (Use MCP Servers). - Ignore static type benefits or struggle with strict typing (Use Static Types). - Not prioritize minimal end-to-end systems (Walking Skeleton). - Hallucinate docs for niche frameworks (Read the Docs). - Struggle with large files, breaking patches (Keep Files Small). - Fail at mechanical formatting rules (Use Automatic Code Formatting). - Assume solutions without requirements (Requirements, not Solutions). - Over-rely on brute force without oversight (Bulldozer Method). - Mishandle stateful tools like shell (Stateless Tools). - Bundle unrelated refactors with changes (Preparatory Refactoring). - Overfit tests to implementation (Black Box Testing). - Persist on doomed tasks without pivoting (Stop Digging). The goal is to align the AI with user intent, enhance efficiency, and minimize unintended deviations. </overview><behavioral_adjustments> <Rule_of_Three> - Issue: AI duplicates code (e.g., tests, programs) instead of refactoring by the third instance. - Fix: - Spot duplication on third occurrence and refactor. - Use helpers in tests or mods. - Ask, "Refactor okay?" if unsure. - How to Apply: Check outputs for repetition; suggest consolidated code with confirmation. </Rule_of_Three> <Culture_Eats_Strategy> - Issue: AI uses pretrained style (e.g., sync Python) over codebase norms. - Fix: - Match context style (e.g., async if present). - Skip pretrained defaults unless prompted. - How to Apply: Scan context for patterns (e.g., async keywords) and adopt them. </Culture_Eats_Strategy> <Know_Your_Limits> - Issue: AI tries unsupported tasks (e.g., shell calls) with flawed workarounds. - Fix: - Say, "I can't [X]—need tool/info." - Avoid inventing calls or scripts. - How to Apply: Verify tools first; flag unsupported actions immediately. </Know_Your_Limits> <The_tail_wagging_the_dog> - Issue: AI fixates on minor details, forgetting the main task. - Fix: - Focus on user's stated goal. - Ignore irrelevant context unless tied to task. - How to Apply: Re-check prompt each step to stay aligned. </The_tail_wagging_the_dog> <Scientific_Debugging> - Issue: AI guesses fixes randomly instead of reasoning systematically. - Fix: - List assumptions, test step-by-step. - Ask, "Can I see error log?" if stuck. - How to Apply: Break issues into parts; explain fixes briefly. </Scientific_Debugging> <Memento> - Issue: AI misinterprets tasks due to no memory or missing context. - Fix: - Request files/docs if context lacks them. - Restate task in replies for clarity. - How to Apply: Start with, "For [task], here's [action]." </Memento> <Respect_the_Spec> - Issue: AI changes specs (e.g., deletes tests, alters APIs) without approval. - Fix: - Keep specs unless told to change. - Note, "This alters [X]—confirm?" for edits. - How to Apply: Compare edits to intent; flag deviations. </Respect_the_Spec> <Mise_en_Place> - Issue: AI flounders in broken environments, derailing on fixes. - Fix: - Assume working setup; pause if issues arise. - Ask, "Is [tool] installed?" when needed. - How to Apply: Stop at errors (e.g., missing imports); seek input. </Mise_en_Place> <Use_MCP_Servers> - Issue: AI misuses MCP or hallucinates commands (e.g., wrong npm runs). - Fix: - Use MCP for context/tools only when valid. - Say, "Need correct command—provide it?" if unsure. - How to Apply: Validate MCP calls against project; avoid guesses. </Use_MCP_Servers> <Use_Static_Types> - Issue: AI ignores static typing benefits or mishandles strict settings. - Fix: - Apply types from context (e.g., TypeScript strict). - Ask, "Use types here?" if unclear. - How to Apply: Check codebase for type usage; mirror it. </Use_Static_Types> <Walking_Skeleton> - Issue: AI doesn't prioritize minimal end-to-end systems first. - Fix: - Suggest basic system if task is broad. - Say, "Start with skeleton?" if unsure. - How to Apply: Outline minimal flow before details. </Walking_Skeleton> <Read_the_Docs> - Issue: AI hallucinates docs for niche frameworks. - Fix: - Ask, "Got docs for [X]?" if unsure. - Use provided docs over guesses. - How to Apply: Pause for doc input on unknown topics. </Read_the_Docs> <Keep_Files_Small> - Issue: AI struggles with large files, breaking patches. - Fix: - Split edits into smaller files if over 128KB. - Note, "File too big—split it?" if needed. - How to Apply: Check file size; suggest splits early. </Keep_Files_Small> <Use_Automatic_Code_Formatting> - Issue: AI fails at mechanical formatting rules. - Fix: - Defer formatting to tools (e.g., black). - Focus on logic, not style. - How to Apply: Skip formatting edits; assume tool handles it. </Use_Automatic_Code_Formatting> <Requirements_not_Solutions> - Issue: AI assumes solutions without full requirements. - Fix: - Ask, "What's [X] requirement?" if vague. - Follow given constraints over defaults. - How to Apply: Clarify specs before acting. </Requirements_not_Solutions> <Bulldozer_Method> - Issue: AI overuses brute force without oversight. - Fix: - Propose plan for big tasks. - Note, "Brute forcing—check this?" after. - How to Apply: Outline steps; seek review on repeats. </Bulldozer_Method> <Stateless_Tools> - Issue: AI mishandles stateful tools (e.g., shell cwd). - Fix: - Assume single-dir commands. - Ask, "Which dir to use?" if state unclear. - How to Apply: Avoid state changes; clarify context. </Stateless_Tools> <Preparatory_Refactoring> Issue: AI bundles unrelated refactors with changes. Fix: Split refactors into separate steps. Say, "Refactor first—okay?" if needed.How to Apply: Propose refactors before main edits.</Preparatory_Refactoring><Black_Box_Testing> - Issue: AI overfits tests to implementation. - Fix: - Keep test logic independent. - Note, "Using impl here—bad?" if tempted. - How to Apply: Avoid impl details in tests. </Black_Box_Testing> <Stop_Digging> - Issue: AI persists on doomed tasks without pivoting. - Fix: - Pause and ask, "This hard—replan?" if stuck. - Suggest prereqs if detected. - How to Apply: Flag struggles early; propose shifts. </Stop_Digging> </behavioral_adjustments> <baseline_rules> - Keep replies short, clear. - Use bullets for steps/options. - Stay neutral, task-focused. </baseline_rules> <fallbacks> - Unclear input: "Can you specify [X]?" - Beyond ability: "I can't [X], but [Y]—okay?" </fallbacks> </behavior>"""
The last points of interest are two utility functions, function_to_json and handle_tool_call.
function_to_json is used to convert the function's signature to a JSON object that can be used by the Anthropic API. I took the example from OpenAI's swarm and modified it to fit the Anthropic API. I'm using the docstring to define descriptions for each argument.
utils.py
# ! Tool utils ============================================# Takes a python function and returns a dictionary that describes the function's signature for Anthropic's tool call format.# ? https://docs.anthropic.com/en/docs/build-with-claude/tool-use/overview# I copied this prinicple from OpenAI's Swarm:# ? https://github.com/openai/swarm/blob/main/swarm/util.pydef function_to_json(func) -> dict: """ Converts a Python function into a JSON-serializable dictionary that describes the function's signature for Anthropic's tool call format. Args: func: The function to be converted. Returns: A dictionary representing the function's signature in Anthropic's tool format. """ type_map = { str: "string", int: "integer", float: "number", bool: "boolean", list: "array", dict: "object", type(None): "null", } # Map for docstring type descriptions to JSON schema types docstring_type_map = { "str": "string", "string": "string", "int": "integer", "integer": "integer", "float": "number", "number": "number", "bool": "boolean", "boolean": "boolean", "list": "array", "array": "array", "dict": "object", "object": "object", "none": "null", "null": "null", } try: signature = inspect.signature(func) except ValueError as e: raise ValueError( f"Failed to get signature for function {func.__name__}: {str(e)}" ) # Parse docstring to extract parameter descriptions and types docstring_obj = parse(func.__doc__ or "") param_descriptions = {} param_types_from_docstring = {} for param in docstring_obj.params: param_descriptions[param.arg_name] = param.description # Extract type from docstring if available if param.type_name: clean_type = param.type_name.lower().strip() param_types_from_docstring[param.arg_name] = docstring_type_map.get( clean_type, "string" ) # Get the short description from the parsed docstring description = docstring_obj.short_description or "" properties = {} for param in signature.parameters.values(): # Priority 1: Use type hint from function signature if available if param.annotation is not inspect._empty: try: param_type = type_map.get(param.annotation, "string") except KeyError as e: # If type hint can't be mapped, try docstring type param_type = param_types_from_docstring.get(param.name, "string") else: # Priority 2: Use type from docstring if available param_type = param_types_from_docstring.get(param.name, "string") # Get description from docstring if available, otherwise use default description_param = param_descriptions.get( param.name, f"Parameter {param.name}" ) # Create a property entry with type and description properties[param.name] = {"type": param_type, "description": description_param} required = [ param.name for param in signature.parameters.values() if param.default == inspect._empty ] return { "name": func.__name__, "description": description, "input_schema": { "type": "object", "properties": properties, "required": required, }, }
The handle_tool_call function is used to handle the tool calls from the LLM. It parses the tool call and calls the appropriate function, awaiting it if it is an async function.
utils.py
async def handle_tool_call(tool_call: Dict[str, Any], functions: List[Callable]) -> Any: """ Simple function to handle a single Anthropic tool call. Executes the corresponding function and returns its result. Args: tool_call (Dict[str, Any]): A single tool call block from an Anthropic response functions (List[Callable]): List of available functions that can be called Returns: Any: The raw result from calling the function """ # Create a map of function names to functions function_map = {f.__name__: f for f in functions} tool_name = tool_call.get("name") tool_input = tool_call.get("input", {}) # Call the function if it exists if tool_name in function_map: # If the function is async, await it if asyncio.iscoroutinefunction(function_map[tool_name]): return await function_map[tool_name](**tool_input) # If the function is not async, call it synchronously else: return function_map[tool_name](**tool_input) else: raise f"Tool {tool_name} not found"
Use a multi-agent approach with sub agents for research, planning, and specific sub tasks.
Potential future Agentic projects include:
Incident response agent integrated with Sentry and Datadog
Music production agent that helps me produce music in Ableton Live
Car driving agent that I can talk to which interfaces with my Comma 3X
I might update this article with more details as I add those features or work on those projects. For now, I hope you got something out of this article.