Building Fault-Tolerant AI Agents with Temporal and LangGraph

Building a proof-of-concept AI agent is easier than ever. With a few lines of LangChain code and an OpenAI API key, you can create a bot that browses the web, writes code, or plans a trip. However, moving that agent into a production environment—where it must handle network partitions, API rate limits, long-running human approvals, and server restarts—reveals a significant "reliability gap."

In a production setting, an agentic workflow isn't just a sequence of LLM calls; it is a long-running, stateful distributed system. If your server crashes in the middle of a 10-minute research task, does the agent restart from scratch (wasting tokens and time), or does it pick up exactly where it left off?

To build truly resilient AI systems, we need to combine the flexible reasoning of LangGraph with the industrial-grade durable execution of Temporal. This article explores how to architect these systems to be fault-tolerant and state-recoverable.

The Architecture of Agentic Reliability

Before diving into the implementation, we must distinguish between the two layers of our stack:

The Reasoning Layer (LangGraph): This defines the "brain." LangGraph excels at managing cyclic graphs, where an agent might loop back to a previous step based on an LLM's decision. It maintains a state object that evolves as the agent progresses.
The Execution Layer (Temporal): This defines the "body." Temporal ensures that the code actually runs to completion. It handles retries, timeouts, and state persistence. If a worker process dies, Temporal migrates the execution to another worker without losing progress.

By wrapping a LangGraph state machine inside a Temporal Workflow, we gain the ability to survive infrastructure failures while maintaining the complex logic required for autonomous agents.

Why LangGraph's Native Persistence Isn't Enough

LangGraph provides a Checkpointer interface that allows you to save the state of a graph to a database (like SQLite or Postgres). While this is excellent for basic state recovery, it doesn't solve several distributed systems problems:

Durable Timers: If an agent needs to wait 24 hours for a user response, a standard Python process shouldn't just sleep().
Automatic Retries with Backoff: LLM APIs are notorious for transient failures. While you can add retry logic inside your graph nodes, managing complex retry policies across multiple distributed steps is difficult.
Side Effect Guarantees: If an agent sends an email and then crashes before saving its state, you risk sending the email twice upon restart. Temporal's "Activities" provide idempotency guarantees that prevent these side effects.

Pattern: The Temporal-LangGraph Wrapper

The most effective way to integrate these technologies is to treat the entire LangGraph execution as a series of Temporal Activities or as a single Workflow that yields to Temporal for durable steps.

1. Defining the State

In LangGraph, state is typically a TypedDict. In Temporal, this state must be serializable so it can be persisted in Temporal's History Service.

from typing import TypedDict, List

class AgentState(TypedDict):
    task: str
    plan: List[str]
    steps_completed: int
    final_report: str

2. The LangGraph Logic

We define our nodes as usual. However, we keep these nodes "pure"—they should focus on the logic and LLM interaction, leaving the heavy lifting of external API calls to Temporal Activities later if they require high reliability.

from langgraph.graph import StateGraph, END

def planner_node(state: AgentState):
    # LLM logic to create a plan
    return {"plan": ["step1", "step2"], "steps_completed": 0}

def executor_node(state: AgentState):
    # LLM logic to execute a step
    return {"steps_completed": state["steps_completed"] + 1}

workflow = StateGraph(AgentState)
workflow.add_node("planner", planner_node)
workflow.add_node("executor", executor_node)
workflow.set_entry_point("planner")
# ... add edges and logic ...

3. The Temporal Workflow Orchestration

This is where the magic happens. We wrap the graph execution inside a Temporal Workflow. Instead of running the whole graph in one go, we can run it step-by-step, allowing Temporal to checkpoint the state after every node transition.

from temporalio import workflow

@workflow.def
class AgenticWorkflow:
    @workflow.run
    async def run(self, task: str) -> str:
        state = {"task": task, "plan": [], "steps_completed": 0, "final_report": ""}
        app = workflow.compile()
        
        # We iterate through the graph transitions
        async for event in app.astream(state):
            # Temporal checkpoints the state automatically here
            # If the worker fails, it resumes from the last event
            workflow.logger.info(f"Transition occurred: {event}")
            
        return state["final_report"]

Handling Long-Running Tasks and Human-in-the-loop

One of the biggest challenges in agentic workflows is the "Human-in-the-loop" (HITL) requirement. An agent might generate a plan that requires a manager's approval before proceeding with expensive operations.

Temporal handles this elegantly using Signals. A Signal is an external message sent to a running workflow.

Implementing Approval Gates

In a standard LangGraph setup, you might use a break_point. In a Temporal-integrated system, the workflow reaches a point where it calls workflow.wait_condition(). The workflow stays idle—consuming zero CPU resources—until a signal is received.

@workflow.def
class ResearchWorkflow:
    def __init__(self):
        self.approved = False

    @workflow.signal
    def approve_plan(self):
        self.approved = True

    @workflow.run
    async def run(self, topic: str):
        # 1. Generate plan via LangGraph activity
        plan = await workflow.execute_activity(generate_plan_act, topic)
        
        # 2. Wait for human signal
        await workflow.wait_condition(lambda: self.approved)
        
        # 3. Proceed with execution
        result = await workflow.execute_activity(execute_plan_act, plan)
        return result

This approach ensures that even if the human takes three days to click "Approve," the agent's state is safely stored in Temporal's history, immune to server upgrades or restarts.

Solving the Non-Determinism Problem

Temporal requires that Workflow code be deterministic. This is a challenge because LLMs are inherently non-deterministic. If you run an LLM call directly inside a Temporal Workflow, and the workflow needs to "replay" its history after a crash, the LLM might return a different result, causing a non-determinism error.

The Rule: Never call an LLM (or any API) directly inside a Temporal Workflow function. Always wrap LLM calls in Temporal Activities.

Activities are not replayed; their results are recorded in the workflow history. When a workflow restarts, Temporal simply looks up the result of the activity from the history instead of re-executing it. This guarantees that the LangGraph state remains consistent during a recovery event.

Real-World Example: The "Self-Healing" Data Pipeline

Imagine an agent responsible for extracting data from various PDF sources and inserting it into a database.

Node 1 (LangGraph): Analyze PDF structure and decide on extraction logic.
Activity 1 (Temporal): Perform the OCR/Extraction. If the OCR service is down, Temporal retries with exponential backoff.
Node 2 (LangGraph): Validate extracted data against a schema.
Activity 2 (Temporal): Write to the DB. If the DB has a lock, Temporal handles the queueing.

If the system runs out of memory during step 3, Temporal restarts the workflow. It sees that Activity 1 was successful, retrieves the PDF data from history, and jumps straight back to the LangGraph validation node. No redundant API costs, no lost progress.

Performance and Scaling Considerations

While combining these two systems adds complexity, the scalability benefits are immense.

State Size: Keep your LangGraph state lean. Since Temporal stores the history of every state change, massive state objects (like large document buffers) can bloat the history. Store large blobs in S3 and pass references (URIs) in the state.
Worker Sharding: Both LangGraph (via its checkpointer) and Temporal allow for distributed execution. You can have specific workers dedicated to "Heavy Compute" (LLM nodes) and others for "IO-bound" tasks (database activities).
Observability: Temporal provides a UI that shows the exact path an agent took, including every retry and every state mutation. This is invaluable for debugging why an agent "hallucinated" or got stuck in a loop.

Conclusion: The Path to Production AI

Moving AI agents from "cool demo" to "mission-critical infrastructure" requires a shift in mindset. We must stop treating agents as simple scripts and start treating them as distributed state machines.

By using LangGraph to model the complex, cyclic reasoning of the agent and Temporal to provide the durable execution environment, you solve the most difficult problems in AI orchestration: reliability, state recovery, and human-in-the-loop interaction.

Actionable Next Steps:

Identify the Side Effects: Audit your existing agents for external calls (APIs, DBs, Emails). Wrap these in Temporal Activities.
Externalize State: Move your LangGraph state into a Temporal Workflow to benefit from automatic persistence.
Implement Signals: Use Temporal Signals for any step where an agent might need to wait for an external event or human approval.

This architecture ensures that your agents are not just smart, but also resilient enough to handle the chaos of production environments.