TL;DR
LangGraph is the production standard for multi-agent AI orchestration in 2026. This post covers the reference architecture I use for mid-market deployments: the five-node pattern (router → specialists → tools → evaluator → synthesizer), typed state management, conditional routing, human-in-the-loop approval gates, and the checkpointing setup that makes it debuggable and fault-tolerant.
LangGraph is the framework used in production multi-agent systems at Uber, JPMorgan, LinkedIn, and Klarna. It's also the framework most engineers hit a wall with because the concepts — graphs, nodes, edges, state — are unfamiliar outside of computer science backgrounds.
This post maps those concepts to a concrete reference architecture. By the end, you should be able to design the node structure for any multi-agent use case and understand why the architecture choices are made.
Why LangGraph Won
The multi-agent framework landscape in 2024–2025 had three serious contenders: LangGraph, CrewAI, and AutoGen. All three can build multi-agent systems. They differ in where they put the control.
LangGraph won for production use cases because of three properties the others lack: explicit control flow (you can read the graph and understand exactly what the agent can do), inspectable state (every node receives a typed dict — no hidden message passing), and checkpointing (execution can pause, persist, and resume across human approval gates or failures). For enterprise systems that need audit trails and human oversight, these properties are essential.
Core Concepts in 4 Minutes
State
LangGraph state is a typed dictionary that flows through every node in the graph. Every node reads from it and writes to it. Nothing is hidden.
from typing import TypedDict, Annotated, List
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
# Input
user_query: str
# Routing
intent: str # "lookup" | "analysis" | "action"
confidence: float
# Retrieval
retrieved_docs: List[dict]
# Messages — add_messages handles deduplication + ordering
messages: Annotated[List, add_messages]
# Output
final_answer: str
citations: List[str]
# Control
should_escalate: bool
retry_count: intNodes
Nodes are pure functions. They receive the current state and return a partial state update. They can call LLMs, run tools, or make routing decisions.
def router_node(state: AgentState) -> dict:
"""Classifies intent and sets routing fields."""
response = llm.invoke([
SystemMessage(content=ROUTER_SYSTEM_PROMPT),
HumanMessage(content=state["user_query"])
])
parsed = json.loads(response.content)
return {
"intent": parsed["intent"],
"confidence": parsed["confidence"]
}Conditional Edges
Edges define control flow. Conditional edges route to different nodes based on state.
def route_after_router(state: AgentState) -> str:
"""Returns the name of the next node based on intent."""
if state["confidence"] < 0.6:
return "clarification_node"
routing = {
"lookup": "retrieval_agent",
"analysis": "analysis_agent",
"action": "action_agent",
}
return routing.get(state["intent"], "fallback_node")The Reference Architecture
For mid-market enterprise deployments, the five-node pattern covers the majority of use cases cleanly:
routerEntry pointClassifies intent from user query. Sets routing fields. Routes to the appropriate specialist. Handles low-confidence cases with a clarification request.
specialist_agentsDomain expertsOne node per task category. Each has a domain-specific system prompt and can invoke tools. Returns partial state with its findings. Multiple specialists can run in parallel via Send().
tool_nodesDeterministic executionExternal system calls — CRM queries, database lookups, API calls. Return structured results. Failures return structured error dicts, not exceptions.
evaluatorQuality gateValidates specialist output. Checks faithfulness to retrieved context. Scores confidence. Routes back to specialist for retry if quality < threshold (max 2 retries).
synthesizerResponse assemblyAssembles final answer from multi-specialist outputs. Adds citations. Formats for the target interface (API JSON, Slack markdown, email prose). Returns final_answer.
Wiring It Together
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
builder = StateGraph(AgentState)
# Add nodes
builder.add_node("router", router_node)
builder.add_node("retrieval", retrieval_agent)
builder.add_node("analysis", analysis_agent)
builder.add_node("tools", ToolNode(tools))
builder.add_node("evaluator", evaluator_node)
builder.add_node("synthesizer", synthesizer_node)
# Entry
builder.add_edge(START, "router")
# Conditional routing after router
builder.add_conditional_edges("router", route_after_router)
# Specialist → tools (agents call tools via ToolNode pattern)
builder.add_conditional_edges("retrieval", tools_condition)
builder.add_conditional_edges("analysis", tools_condition)
builder.add_edge("tools", "retrieval") # Return to caller
# Specialists → evaluator
builder.add_edge("retrieval", "evaluator")
builder.add_edge("analysis", "evaluator")
# Evaluator: pass or retry
builder.add_conditional_edges("evaluator", route_after_eval)
# Evaluator → synthesizer → END
builder.add_edge("synthesizer", END)
# Checkpointing for persistence + human-in-the-loop
checkpointer = MemorySaver() # Use PostgresSaver in production
graph = builder.compile(
checkpointer=checkpointer,
interrupt_before=["action_agent"] # Pause before destructive actions
)Human-in-the-Loop: The Pattern That Makes Enterprise Trust It
Enterprise AI agents take actions — updating CRM records, sending emails, triggering workflows. These actions need human approval before execution. LangGraph's interrupt_before makes this a first-class concern, not a workaround.
# Step 1: Run until the approval gate
config = {"configurable": {"thread_id": "session-123"}}
result = graph.invoke({"user_query": user_input}, config)
# graph pauses at interrupt_before=["action_agent"]
# result contains the pending action for human review
# Step 2: Show pending action to human
pending_action = result["pending_action"]
display_for_approval(pending_action)
# Step 3: Resume with human decision
if human_approved:
# Resume — graph continues from interrupt point
final = graph.invoke(None, config)
else:
# Cancel — update state before resuming
graph.update_state(config, {"should_cancel": True})
final = graph.invoke(None, config)The entire agent state — including every LLM call, every tool result, every intermediate decision — is persisted by the checkpointer across the human review pause. You get a complete audit trail for every action taken by the agent.
Production Deployment Checklist
PostgresSaver or RedisCheckpointer instead of MemorySaver
MemorySaver is in-process only — doesn't survive restarts
interrupt_before on all state-mutating nodes
Any node that calls write APIs, sends messages, or modifies records
LangSmith tracing enabled
LANGCHAIN_TRACING_V2=true in env — full graph traces, zero code changes
Max retry count in state
Prevents evaluator → specialist loops from running indefinitely
Token budget per run
Hard ceiling on total tokens per graph invocation — log and alert on approach
Typed state with validation
Pydantic BaseModel state catches type errors before they corrupt downstream nodes
Frequently Asked
What is LangGraph and why is it used for multi-agent orchestration?
LangGraph is a framework for building stateful, multi-actor AI applications as directed graphs. Each node is a function; edges define control flow; state is a typed dictionary shared across all nodes. It's preferred for production systems because it makes control flow explicit, state inspectable, and execution checkpointable.
What is the difference between LangGraph and CrewAI?
LangGraph gives you explicit control over state, control flow, and checkpointing — it's a low-level orchestration primitive. CrewAI is a higher-level framework with opinionated agent roles and communication patterns. LangGraph is preferred for production systems where you need fine-grained control and debuggable execution traces. CrewAI is faster to prototype.
How do you add human-in-the-loop to a LangGraph agent?
LangGraph supports human-in-the-loop via interrupt_before and interrupt_after parameters on the compiled graph. When the graph reaches a designated node, execution pauses and returns control to your application. You can present the pending action for human approval, then resume execution by calling graph.invoke() with the same thread_id. State is persisted across the interruption.
What checkpointer should I use in production LangGraph?
Use PostgresSaver or RedisCheckpointer in production — MemorySaver is in-process only and does not survive restarts. PostgresSaver persists the full agent state to a Postgres database, enabling pause/resume across server restarts, cross-session history, and complete audit trails for human-in-the-loop workflows.