Building Better AI Agents: Observability and Evaluation

[HPP] Harrison ChaseFebruary 10, 202647 min

29 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

Understanding Agent Observability

💡 AI agents fail due to reasoning errors, unlike traditional software crashes, necessitating new observability and evaluation methods.
🧠 Software is deterministic, but LLM apps and agents are non-deterministic, leading to emergent behaviors that are hard to predict from code alone.
🔑 Debugging agents shifts from inspecting stack traces to analyzing reasoning failures, context engineering, and LLM inputs/outputs.

Core Observability Primitives

🎯 Runs represent a single execution turn, like an LLM call, capturing inputs, outputs, and parameters.
📈 Traces are sequences of runs, representing a full agent execution from start to finish without human intervention.
💬 Threads group multiple traces, incorporating human intervention (e.g., user messages in a chatbot conversation) to capture full interactions.

Agent Evaluation Strategies

✅ Evaluation tests reasoning across different dimensions: single-step (individual LLM calls), full-turn (complete agent runs), and multi-turn (conversational threads).
🚀 Shipping to production early is crucial for discovering what to test, as user behavior and inputs inform the creation of effective test sets.
📊 Offline evaluation uses production traces to build datasets for benchmarking, while online evaluation flags issues in real-time production traffic, such as unusual tool calls or efficiency problems.

Practical Application with LangSmith

🛠️ LangSmith facilitates logging all agent runs, enabling manual debugging by inspecting traces and individual runs.
🔬 Automated evaluation can be performed at scale using tools like the Langsmith Trace Analyzer, which processes large datasets of traces to categorize errors and propose fixes.
💡 Observability is integral to the agent development loop, providing insights into agent behavior and powering continuous improvement through iterative testing and prompt optimization.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 29 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters18 moments

Key Moments

Transcript176 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

AI agentsObservabilityEvaluationLangSmithLLM appsDebugging reasoningAgent runsTracesThreadsOffline evaluationOnline evaluationProduction tracesDeep AgentsPrompt optimizationCoding agents

Smart Objects40 · 29 links

Products· 6

Concepts· 31

Person· 1

Media· 1

Company· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free