<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Reliability | The AWS Blog</title><link>https://theawsblog.com/tags/reliability/</link><description>Articles, tutorials and insights from the AWS community.</description><generator>Hugo</generator><language>en</language><managingEditor>@theawsblog (The AWS Blog)</managingEditor><webMaster>@theawsblog</webMaster><lastBuildDate>Mon, 29 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://theawsblog.com/tags/reliability/index.xml" rel="self" type="application/rss+xml"/><item><title>Lambda durable functions fit the messy middle of agent workflows</title><link>https://theawsblog.com/news/emiliano-montesdeoca/lambda-durable-functions-agent-workflows/</link><pubDate>Mon, 29 Jun 2026 00:00:00 +0000</pubDate><author>Emiliano Montesdeoca</author><guid>https://theawsblog.com/news/emiliano-montesdeoca/lambda-durable-functions-agent-workflows/</guid><description>AWS Lambda durable functions give multi-agent and human-in-the-loop workflows checkpointing, replay, callbacks, and polling without forcing every team to assemble custom orchestration infrastructure.</description><content:encoded>&lt;p&gt;Agent workflows are rarely a neat chain of fast API calls. They wait for humans, retry model calls, poll external systems, compensate for failed steps, and burn money when the same expensive operation runs twice.&lt;/p&gt;
&lt;p&gt;The AWS Compute Blog post on &lt;a href="https://aws.amazon.com/blogs/compute/building-fault-tolerant-multi-agent-ai-workflows-with-aws-lambda-durable-functions/"&gt;building fault-tolerant multi-agent AI workflows with AWS Lambda durable functions&lt;/a&gt; is useful because it focuses on that messy middle. The healthcare prior authorization example is domain-specific, but the orchestration problem is common.&lt;/p&gt;
&lt;h2 id="what-changed"&gt;What changed&lt;/h2&gt;
&lt;p&gt;Lambda durable functions extend the Lambda programming model with checkpoint and replay operations. The source article highlights patterns such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;context.step()&lt;/code&gt; for checkpointed work,&lt;/li&gt;
&lt;li&gt;callback waits for human review or external completion signals,&lt;/li&gt;
&lt;li&gt;condition waits for polling long-running systems,&lt;/li&gt;
&lt;li&gt;replay behavior that skips completed durable operations,&lt;/li&gt;
&lt;li&gt;execution metrics and status events for operational visibility.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That means a long-running workflow can pause without paying compute charges during the wait, then resume from the right point when a callback arrives or a condition changes.&lt;/p&gt;
&lt;h2 id="why-it-matters-for-ai-systems"&gt;Why it matters for AI systems&lt;/h2&gt;
&lt;p&gt;Multi-agent workflows are expensive and non-deterministic. If an extraction agent, reasoning agent, or synthesis agent has already completed, you usually do not want a transient failure later in the flow to rerun everything from the start.&lt;/p&gt;
&lt;p&gt;Durable checkpointing directly addresses that. It also makes human-in-the-loop patterns less awkward. A review step can suspend for hours or days without holding a running process open or building separate queue-and-database plumbing for every workflow.&lt;/p&gt;
&lt;p&gt;This is not only an AI feature. It is orchestration for any process where work is valuable, waits are long, and retries must be controlled.&lt;/p&gt;
&lt;h2 id="the-architectural-trade-off"&gt;The architectural trade-off&lt;/h2&gt;
&lt;p&gt;The biggest benefit is also the thing to design carefully: workflow logic lives in code.&lt;/p&gt;
&lt;p&gt;That can be cleaner than stitching together many small infrastructure pieces, but it requires discipline. Builders need deterministic step boundaries, idempotent external calls, clear timeout behavior, and replay-aware logging. If a step charges a credit card, submits a claim, sends an email, or updates a ticket, retries must reuse idempotency keys.&lt;/p&gt;
&lt;p&gt;You also need to decide when Step Functions is still a better fit. If the team benefits from visual state machines, service integrations, explicit workflow definitions, or non-developer operators inspecting the flow, Step Functions remains strong. Durable functions are attractive when the orchestration is code-heavy, developer-owned, and benefits from staying close to the Lambda handler.&lt;/p&gt;
&lt;h2 id="what-builders-should-do-next"&gt;What builders should do next&lt;/h2&gt;
&lt;p&gt;Do not start by converting every workflow. Start with one painful process that has three properties:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;expensive completed steps that should not repeat,&lt;/li&gt;
&lt;li&gt;long waits for humans or external systems,&lt;/li&gt;
&lt;li&gt;retry behavior that is currently implemented with custom glue.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Then design the durable function around failure cases first. What happens if an agent times out? If the human rejects the result? If the external API accepts a submission but the response is lost? If the workflow exceeds the business deadline?&lt;/p&gt;
&lt;p&gt;The source example uses prior authorization, but the pattern applies to code review agents, document processing, procurement approvals, incident remediation, and migration assessment pipelines.&lt;/p&gt;
&lt;p&gt;The practical takeaway: durable functions are not just about making Lambda run longer. They are about making long-running workflows resumable, observable, and cheaper to wait on.&lt;/p&gt;</content:encoded></item></channel></rss>