Error Handling¶
Every error on the mesh is structured. No raw tracebacks, no mystery strings.
from openagentmesh import AgentMesh, MeshError
mesh = AgentMesh("nats://localhost:4222")
async with mesh:
try:
result = await mesh.call("summarizer", {"text": 42}) # wrong type
except MeshError as e:
print(e.code) # "handler_error"
print(e.message) # "Field 'text' expected str, got int"
print(e.agent) # "summarizer"
print(e.request_id) # "a1b2c3..."
Error Envelope¶
When an agent returns X-Mesh-Status: error, the response body is always this shape:
{
"code": "handler_error",
"message": "Field 'text' expected str, got int",
"agent": "summarizer",
"request_id": "a1b2c3d4",
"details": {}
}
The details field carries extra context when available.
Error Codes¶
| Code | When | Exception class |
|---|---|---|
handler_error |
The agent's handler function raised an exception (including Pydantic validation failures) | MeshError |
timeout |
The agent didn't respond within the timeout window | MeshTimeout |
not_found |
No agent registered with that name | MeshError |
invocation_mismatch |
Wrong verb for the agent's capabilities | InvocationMismatch |
chunk_sequence_error |
Stream chunks arrived out of order | ChunkSequenceError |
connection_failed |
Could not connect to the mesh | MeshError |
Error Subclasses¶
Invocation mismatches have a dedicated MeshError subclass. The message describes the specific mismatch and suggests the correct verb:
from openagentmesh import InvocationMismatch
try:
result = await mesh.call("price-feed", {"symbol": "AAPL"})
except InvocationMismatch as e:
print(e.message)
# "Agent 'price-feed' is a publisher and cannot be called. Subscribe to its events instead"
All subclasses inherit from MeshError, so except MeshError: still catches everything.
The SDK checks capabilities before sending any NATS message. On connect, the catalog cache is seeded from the current KV snapshot, so the check works from the first invocation even for pure-caller processes.
Mismatch scenarios¶
| Verb | Target shape | Message |
|---|---|---|
call() on Publisher |
invocable=false, streaming=true | "is a publisher and cannot be called. Subscribe to its events instead" |
call() on Watcher |
invocable=false, streaming=false | "is a background task and cannot be called" |
call() on Streamer |
invocable=true, streaming=true | "is streaming-only. Use stream() instead" |
stream() on Responder |
invocable=true, streaming=false | "does not support streaming. Use call() instead" |
stream() on Publisher |
invocable=false, streaming=true | "is a publisher and cannot be streamed. Subscribe to its events instead" |
send() on Publisher |
invocable=false, streaming=true | "is a publisher and cannot be sent to. Subscribe to its events instead" |
How Errors Propagate¶
The mesh catches exceptions so callers don't have to guess what went wrong.
Non-streaming invocation¶
- The caller sends a request via
mesh.call(). - Capabilities are checked before the request is sent. If the agent is non-invocable or streaming-only,
InvocationMismatchis raised locally (no round trip). - The agent-side runtime validates and deserializes the payload, then calls the handler. If validation fails or the handler raises an exception, the mesh wraps it in the error envelope with code
handler_errorand returns it to the caller. - The caller receives a structured
MeshError, not a raw Python exception.
Streaming invocation¶
- The caller sends a request via
mesh.stream(). - Capabilities are checked before the request is sent. If the target agent is non-streaming or non-invocable,
InvocationMismatchis raised locally (no round trip). - If the handler's async generator raises mid-stream (after yielding some chunks), the error is published to the stream subject. The caller receives all chunks up to the failure, then gets the
MeshError.
Handler authors don't need to catch their own errors for the caller's sake. The mesh does it. But you can still raise specific exceptions if you want to control the error message.
Dead-Letter Subject¶
Every error is also published to the agent's dead-letter subject:
Subscribe to this subject for monitoring, alerting, or debugging. The payload is the same error envelope.
async def on_error(msg):
error = json.loads(msg.data)
logger.warning(f"{error['agent']}: {error['code']} - {error['message']}")
await nc.subscribe("mesh.errors.nlp.summarizer", cb=on_error)
This is a passive stream. It doesn't affect the caller's response.
Failure Modes¶
There are four ways an agent can leave the mesh. Each produces a different caller experience.
The four failure modes¶
| Mode | Cause | What the caller sees | Detection speed |
|---|---|---|---|
| Graceful shutdown | Context manager exit or process interrupt | MeshError(code="not_found") on next call (agent already deregistered) |
Instant |
| Handler exception | Bug in handler code | MeshError(code="handler_error") with the exception message |
Instant |
| Process crash | OOM kill, SIGKILL, unhandled panic | MeshError(code="timeout") after the timeout window expires |
Timeout (default varies by agent type) |
| Network partition | Network failure between agent and NATS | MeshError(code="timeout") after the timeout window expires |
Timeout |
Handler exceptions (the common case)¶
When a handler raises an exception, the mesh catches it and:
- Wraps it in the error envelope with code
handler_error - Sends it back to the caller on the reply subject (for
mesh.call()) or stream subject (formesh.stream()) - Publishes it to the dead-letter subject
mesh.errors.{channel}.{name}for observability
The caller always gets a structured MeshError, never a raw traceback.
Mid-stream failures¶
If a streaming handler yields some chunks then crashes:
chunks = []
try:
async for chunk in mesh.stream("summarizer", payload):
chunks.append(chunk) # partial data is still delivered
except MeshError as e:
print(f"Stream failed after {len(chunks)} chunks: {e.code}")
# chunks contains everything received before the failure
Partial data has value (especially for LLM streaming). The error signals "no more is coming", not "discard what you have".
Process crashes and timeouts¶
When an agent process dies after accepting a request (crash, OOM, kill), no error reply is sent because the process is gone. The caller's mesh.call() or mesh.stream() waits until the timeout expires, then raises MeshError(code="timeout").
The timeout is set per call:
There is no way to distinguish "agent is slow" from "agent is dead" at the caller level today. Future versions may use death notices to detect agent death mid-request and fail fast.
Choosing timeout values¶
- Tool agents (deterministic, fast): 1-5 seconds. A tool that hasn't responded in 5s is almost certainly dead.
- LLM agents (variable latency): 30-120 seconds. LLM calls can genuinely take this long.
- Human-in-the-loop: Use
mesh.send()with async callbacks instead ofmesh.call(). Don't block on humans.