LLM-Powered Agents: Technical Deep Dive
This post is technical. If you're a business stakeholder wanting to understand AI agents, read our non-technical guide instead.
Still here? Good. Let's get into the details of how LLM-powered agents actually work.
The Basic Agent Loop
At its core, an LLM agent is a loop:
while not task_complete:
1. Observe: Gather current state and context
2. Reason: LLM decides what to do next
3. Act: Execute the chosen action
4. Evaluate: Check if goal is achieved
The sophistication comes from how each step is implemented.
Model Selection
Not all LLMs are equal for agent tasks. Considerations:
Reasoning Capability
Agents need models that can:
- Follow multi-step instructions
- Use tools correctly
- Know when they don't know something
- Recover from errors
As of early 2025, our typical choices:
- GPT-4 / GPT-4 Turbo: Strong reasoning, good tool use, expensive
- Claude 3 Opus/Sonnet: Excellent reasoning, better at nuance, good safety properties
- Llama 3 70B: Capable, can self-host, lower inference cost
- Mistral Large: Good balance of capability and cost
We avoid GPT-3.5 and smaller models for autonomous agent tasks. The error rate is too high.
Context Window
Agents often need long context:
- Conversation history
- Retrieved documents
- Tool definitions
- System instructions
GPT-4 Turbo's 128K context and Claude's 200K context are game-changers. You can fit substantial knowledge without aggressive truncation.
Tool Use Support
Native function calling is now standard:
- OpenAI function calling
- Anthropic tool use
- Structured output modes
These are more reliable than prompting the model to output JSON. Use them.
Cost at Scale
At high volume, cost matters. Rough pricing (early 2025):
- GPT-4 Turbo: ~$10-30 per 1M tokens
- Claude Sonnet: ~$3-15 per 1M tokens
- Self-hosted Llama: Infrastructure cost only
For an agent handling 10K interactions/month with average 4K tokens each, that's:
- GPT-4 Turbo: ~$400-1200/month on API alone
- Self-hosted: ~$200-500/month on infrastructure
Worth modelling before committing.
Prompt Architecture
The system prompt is the foundation. Structure it carefully.
Core Components
# Role
You are [agent name], an AI assistant that [core purpose].
# Capabilities
You have access to these tools:
- [tool 1]: [description, when to use]
- [tool 2]: [description, when to use]
# Constraints
- Never [hard boundaries]
- Always [required behaviors]
- When uncertain, [fallback behavior]
# Process
1. [Step-by-step workflow]
2. [Decision points]
3. [Escalation criteria]
# Communication Style
- [Tone guidelines]
- [Formatting preferences]
Dynamic Context
Added to each request:
- Current conversation history
- Retrieved relevant information
- Current state/variables
- User context (if available)
Token Budget Management
With limited context windows, you need strategies:
Conversation summarisation: Periodically compress old conversation turns Sliding window: Keep N recent turns, summarise the rest Selective retrieval: Only include retrieved docs that score above threshold Dynamic few-shot: Include examples relevant to current query
Tool Design
Tools are how agents interact with the world.
Tool Definition
Each tool needs:
{
"name": "update_customer_address",
"description": "Updates a customer's shipping address. Use when customer requests address change.",
"parameters": {
"type": "object",
"properties": {
"customer_id": {
"type": "string",
"description": "The unique customer identifier"
},
"new_address": {
"type": "object",
"properties": {
"street": {"type": "string"},
"city": {"type": "string"},
"state": {"type": "string"},
"postcode": {"type": "string"}
},
"required": ["street", "city", "state", "postcode"]
}
},
"required": ["customer_id", "new_address"]
}
}
Key principles:
- Clear, specific descriptions
- Strongly typed parameters
- Validation constraints where possible
- Examples in description if helpful
Tool Execution Layer
The tool gateway handles:
def execute_tool(tool_name: str, parameters: dict) -> ToolResult:
# 1. Validate parameters
validate_parameters(tool_name, parameters)
# 2. Check permissions
check_permissions(current_user, tool_name, parameters)
# 3. Log the attempt
log_tool_call(tool_name, parameters)
# 4. Execute with timeout
with timeout(TOOL_TIMEOUT):
result = tools[tool_name].execute(parameters)
# 5. Log the result
log_tool_result(tool_name, result)
# 6. Return structured result
return ToolResult(
success=result.success,
data=result.data,
error=result.error
)
Never let the LLM execute arbitrary code or call APIs directly.
Memory Systems
Agents need memory beyond single interactions.
Short-Term Memory
The current conversation. Usually kept in full within context window.
Working Memory
Task-specific state during a multi-turn interaction:
class ConversationState:
intent: str
collected_data: dict
steps_completed: list
pending_actions: list
confidence: float
Persisted between turns, included in prompt.
Long-Term Memory
Persistent information about users, preferences, past interactions:
# Retrieve relevant memory
memories = memory_store.search(
user_id=user.id,
query=current_query,
limit=5,
recency_weight=0.3
)
# Include in prompt
context += format_memories(memories)
Vector databases (Pinecone, Weaviate, Qdrant) work well for this.
Memory Updates
After each interaction:
# Extract memorable information
important_info = extract_memories(conversation)
# Store with metadata
for info in important_info:
memory_store.add(
user_id=user.id,
content=info.content,
embedding=embed(info.content),
metadata={
"timestamp": now(),
"conversation_id": conv_id,
"type": info.type
}
)
Be selective. Not everything should be remembered.
Retrieval Architecture
RAG (Retrieval-Augmented Generation) gives agents access to knowledge.
Indexing Pipeline
Documents → Chunking → Embedding → Vector Store
↓
Metadata Extraction
Chunking strategy matters:
- Too small: loses context
- Too large: dilutes relevance
- Overlapping windows often help
We typically use 500-1000 token chunks with 100 token overlap.
Retrieval Pipeline
Query → Query Embedding → Vector Search → Reranking → Selection
↓
Diversity Filter
↓
Recency Filter
Don't just take top-K by similarity. Consider:
- Result diversity (avoid redundancy)
- Source authority (official docs > wiki > emails)
- Recency (for time-sensitive info)
Hybrid Search
Combine vector search with keyword search:
vector_results = vector_db.search(query_embedding, k=20)
keyword_results = keyword_db.search(query_text, k=20)
# Merge with reciprocal rank fusion
combined = reciprocal_rank_fusion(vector_results, keyword_results)
return combined[:5]
Often outperforms pure vector search.
Orchestration Patterns
Simple Chain
For straightforward tasks:
User Input → LLM Reasoning → Tool Call → Response
ReAct Pattern
Reasoning and acting interleaved:
while not done:
thought = llm.think(observation)
action = llm.decide_action(thought)
observation = execute(action)
if is_final_answer(observation):
done = True
Good for multi-step reasoning with tools.
Plan-and-Execute
For complex tasks:
1. LLM creates full plan
2. Execute each step
3. Re-plan if needed
Better for longer tasks where upfront planning helps.
Multi-Agent
Specialised agents coordinating:
┌─ Research Agent
Orchestrator Agent ─┼─ Writer Agent
└─ Reviewer Agent
Use when tasks have distinct phases requiring different capabilities.
Error Handling
Agents fail. Design for it.
Graceful Degradation
try:
result = agent.run(task)
except ToolError as e:
# Tool failed - acknowledge and try alternative
return fallback_response(e)
except ModelError as e:
# LLM failed - retry with backoff
return retry_with_backoff(task)
except TimeoutError:
# Taking too long - partial response + escalate
return partial_response_with_escalation()
Confidence Calibration
Have the agent assess its own confidence:
Before taking action, rate your confidence 1-10.
Below 6: Ask for clarification
6-8: Proceed but note uncertainty
Above 8: Proceed confidently
Not perfect, but helps catch uncertain responses.
Observability
You cannot improve what you cannot observe.
Logging
Log everything:
- Full prompts and responses
- Tool calls and results
- State changes
- Timing information
- Token counts
Metrics
Track:
- Task completion rate
- Average turns to completion
- Tool use patterns
- Error rates by type
- Latency distribution
- Cost per interaction
Debugging
Build tools for:
- Replaying conversations
- Comparing prompt variations
- Tracing decision paths
- Identifying failure patterns
Production Considerations
Rate Limiting
Protect against:
- API cost explosions
- Runaway agent loops
- Abuse
Implement per-user, per-minute, and per-day limits.
Caching
Cache where possible:
- Embedding calculations
- Retrieval results
- Common responses
Significantly reduces cost and latency.
Fallback Models
If primary model is unavailable:
models = [
("gpt-4-turbo", primary_client),
("claude-3-sonnet", fallback_client),
("llama-3-70b", local_client)
]
for model, client in models:
try:
return client.complete(prompt)
except ServiceUnavailable:
continue
raise AllModelsUnavailable()
Further Reading
This is necessarily a survey. For deeper dives:
- LangChain / LlamaIndex docs for framework approaches
- Anthropic's tool use documentation
- OpenAI's function calling guide
- Our architecture post for enterprise patterns
We build production AI agents with these patterns. Happy to discuss your technical challenges. As our Brisbane consultants, we bring deep technical expertise to every project.