Lambda | The AWS Blog

Lambda durable functions fit the messy middle of agent workflows

Emiliano Montesdeoca — Mon, 29 Jun 2026 00:00:00 +0000

Agent workflows are rarely a neat chain of fast API calls. They wait for humans, retry model calls, poll external systems, compensate for failed steps, and burn money when the same expensive operation runs twice.

The AWS Compute Blog post on building fault-tolerant multi-agent AI workflows with AWS Lambda durable functions is useful because it focuses on that messy middle. The healthcare prior authorization example is domain-specific, but the orchestration problem is common.

What changed

Lambda durable functions extend the Lambda programming model with checkpoint and replay operations. The source article highlights patterns such as:

context.step() for checkpointed work,
callback waits for human review or external completion signals,
condition waits for polling long-running systems,
replay behavior that skips completed durable operations,
execution metrics and status events for operational visibility.

That means a long-running workflow can pause without paying compute charges during the wait, then resume from the right point when a callback arrives or a condition changes.

Why it matters for AI systems

Multi-agent workflows are expensive and non-deterministic. If an extraction agent, reasoning agent, or synthesis agent has already completed, you usually do not want a transient failure later in the flow to rerun everything from the start.

Durable checkpointing directly addresses that. It also makes human-in-the-loop patterns less awkward. A review step can suspend for hours or days without holding a running process open or building separate queue-and-database plumbing for every workflow.

This is not only an AI feature. It is orchestration for any process where work is valuable, waits are long, and retries must be controlled.

The architectural trade-off

The biggest benefit is also the thing to design carefully: workflow logic lives in code.

That can be cleaner than stitching together many small infrastructure pieces, but it requires discipline. Builders need deterministic step boundaries, idempotent external calls, clear timeout behavior, and replay-aware logging. If a step charges a credit card, submits a claim, sends an email, or updates a ticket, retries must reuse idempotency keys.

You also need to decide when Step Functions is still a better fit. If the team benefits from visual state machines, service integrations, explicit workflow definitions, or non-developer operators inspecting the flow, Step Functions remains strong. Durable functions are attractive when the orchestration is code-heavy, developer-owned, and benefits from staying close to the Lambda handler.

What builders should do next

Do not start by converting every workflow. Start with one painful process that has three properties:

expensive completed steps that should not repeat,
long waits for humans or external systems,
retry behavior that is currently implemented with custom glue.

Then design the durable function around failure cases first. What happens if an agent times out? If the human rejects the result? If the external API accepts a submission but the response is lost? If the workflow exceeds the business deadline?

The source example uses prior authorization, but the pattern applies to code review agents, document processing, procurement approvals, incident remediation, and migration assessment pipelines.

The practical takeaway: durable functions are not just about making Lambda run longer. They are about making long-running workflows resumable, observable, and cheaper to wait on.

S3 Files makes Lambda file workflows simpler, but not automatically better

Emiliano Montesdeoca — Wed, 24 Jun 2026 00:00:00 +0000

A lot of Lambda code that works with S3 is not complicated because the business logic is complicated. It is complicated because the function has to download an object, manage /tmp, process the file, upload the result, and clean up after itself.

The AWS Compute Blog post on modernizing Lambda and S3 workloads with Amazon S3 Files shows a different model: mount an S3-backed file system and let the function use normal file paths.

That sounds small. For many workloads, it changes the shape of the code.

What changed

S3 Files lets a Lambda function mount an S3 bucket as a file system at a path such as /mnt/data. Code can open, read, write, list, and organize files using filesystem operations while S3 remains the durable backing store.

The source article uses examples such as image processing, ETL pipelines, and multi-agent shared workspaces. In each case, the function moves away from explicit S3 transfer code and toward direct file I/O.

That removes a surprising amount of glue:

no manual download before processing,
no upload step after writing output,
less /tmp capacity management,
fewer cleanup paths for failed invocations,
simpler handoffs between functions that share a workspace.

Why builders should care

The strongest use case is modernization of existing file-oriented libraries. Many image, document, ML, and data-processing libraries expect file paths. Without S3 Files, Lambda code often has to adapt object storage into temporary local files before real work can start.

S3 Files lets builders keep file-based code and still use S3 as the storage layer. That can make Lambda more attractive for workloads that previously moved to containers only because file handling became awkward.

The shared workspace pattern is also interesting for AI agents. If multiple Lambda functions collaborate on a session, a directory tree can be easier to reason about than a collection of object keys and serialized state blobs.

The trade-offs

I would not replace every S3 API call with file I/O.

Object APIs are still excellent when you want explicit object boundaries, event-driven processing, presigned URLs, lifecycle policies, replication, and direct control over metadata. File systems are easier for some code, but they can also hide transfer and consistency behavior behind familiar syntax.

Builders should validate:

throughput for large files,
behavior under high concurrency,
consistency expectations between producers and consumers,
VPC and mount target requirements,
IAM permissions for mount and write operations,
failure behavior when a function exits while writing.

Also remember that simpler code is not the same as simpler operations. If many functions share a writable workspace, you need naming conventions, cleanup policies, and safeguards against accidental overwrite.

What to do next

Look for Lambda functions where more code is spent on S3 transfers than on the actual business logic. Image resizing, report generation, data conversion, document processing, and agent workspaces are good candidates.

For each candidate, compare two versions:

current S3 API flow with /tmp,
S3 Files flow with mounted paths.

Measure duration, memory, error handling complexity, and cost. The winning design may be different per workload.

The useful takeaway is not that file systems are better than object APIs. It is that Lambda now has a cleaner option when the workload is naturally file-shaped.

Lambda MicroVMs make isolated sandboxes a serverless design choice

Emiliano Montesdeoca — Mon, 22 Jun 2026 00:00:00 +0000

AWS Lambda MicroVMs are interesting because they do not try to replace normal Lambda functions. They fill a different gap: workloads where the unit of isolation is not an event, but a user session, coding environment, agent run, scanner job, or other stateful sandbox.

The AWS announcement frames this around isolated sandboxes with full lifecycle control. That is the right framing. The practical value is not only that Firecracker provides VM-level isolation. It is that AWS is exposing a managed way to create, pause, resume, and retire those environments without asking every product team to become a virtualization platform team.

What changed

Lambda MicroVMs add a serverless compute primitive inside the Lambda family for running code in isolated, stateful execution environments. The source article describes several important properties:

each session can run in its own Firecracker-backed MicroVM,
environments can launch and resume from pre-initialized snapshots,
memory, disk, and running process state can survive during the session,
idle environments can be suspended by policy,
applications get a dedicated endpoint and short-lived request authentication.

That combination matters for applications that cannot fit cleanly into stateless request-response functions. A code interpreter, browser automation sandbox, vulnerability scanner, AI coding assistant, data notebook, or game scripting environment often needs process state and filesystem state between interactions.

Why builders should care

The old decision tree was uncomfortable. Virtual machines gave strong isolation but slow startup and more operations. Containers started quickly but shared a kernel, which raises the bar for safely running untrusted code. Lambda functions were operationally simple but not designed for long-running interactive state.

Lambda MicroVMs create a new middle path. For builders, the design conversation becomes more precise:

Use Lambda functions for event handlers and short stateless tasks.
Use containers when you need packaging flexibility and can manage isolation risk.
Use Lambda MicroVMs when each tenant, user, or agent run needs a dedicated stateful sandbox.

This is especially relevant for AI systems. As more applications let agents write code, execute tools, inspect repositories, or process customer files, isolation becomes part of the product boundary. A prompt injection bug should not become a cross-tenant file access bug.

The trade-offs are still real

MicroVMs reduce a lot of infrastructure work, but they do not remove architecture responsibility.

First, lifecycle policy becomes a cost control. If idle sessions stay warm too long, the bill can drift. If they suspend too aggressively, users feel resume latency. Teams should treat idle duration as a product setting, not a default copied from a sample.

Second, snapshot-based startup changes how applications initialize. Code that generates unique state, opens long-lived external connections, or assumes initialization happens once per user action needs careful review.

Third, stateful sandboxes need cleanup rules. Temporary files, credentials, downloaded packages, generated artifacts, and logs can accumulate. Builders should define what survives during a session, what is exported, and what is destroyed.

Finally, security does not stop at VM boundaries. The execution role, outbound network policy, source artifact pipeline, token handling, and tenant mapping are still part of the isolation story.

What to do next

I would start with workloads where the current workaround is obviously expensive: per-user EC2 sandboxes, over-hardened container runners, or Lambda workflows full of awkward /tmp and rehydration logic.

For a proof of concept, validate four things before celebrating:

Cold launch and resume behavior with your real image size and dependencies.
Idle cost profile for normal user behavior, not a synthetic benchmark.
Tenant boundary tests for filesystem, process, network, and IAM access.
Failure cleanup when a session crashes, times out, or is abandoned.

Lambda MicroVMs are not just another Lambda feature. They are AWS acknowledging that the next wave of serverless workloads includes interactive, stateful, sometimes untrusted execution. That is a useful primitive, as long as teams treat it as an isolation architecture rather than a shortcut around security design.

Lambda runtime upgrades need campaigns, not reminders

Emiliano Montesdeoca — Sat, 20 Jun 2026 00:00:00 +0000

Lambda runtime upgrades are easy to postpone because the function still works today. Then a deprecation date becomes a support risk, a security exception, or a rushed platform campaign.

The AWS Compute Blog post on upgrading Lambda function runtimes at scale with AWS Transform custom is useful because it treats runtime upgrades as a repeatable modernization workflow, not a calendar reminder.

What changed

AWS Transform custom can analyze codebases, identify runtime upgrade risk, apply transformations, update dependencies or configuration, and validate against exit criteria. The source article focuses on Lambda runtime upgrades, including cases where code changes are required, such as moving callback-style Node.js handlers toward async patterns.

For platform teams, the more important piece is campaign management. A large organization rarely has one repository and one owner. It has hundreds of functions across many accounts, teams, IaC stacks, and CI/CD systems.

Why builders should care

Runtime upgrades fail when ownership is unclear.

An application team may not know a function exists. A platform team may see the deprecated runtime but not understand the deployment path. Security may see the exposure but not own the code. The result is a spreadsheet-driven migration with too many manual exceptions.

A transformation tool can help, but only if it plugs into an operating model:

inventory functions and repositories,
map functions to owners,
assess code and dependency risk,
create pull requests or transformation branches,
validate with tests and alarms,
deploy with safe rollout and rollback,
track completion centrally.

That is a campaign, not a one-off task.

The trade-offs

Automated transformation is not a substitute for test coverage. It increases the value of tests because it can make changes faster than humans can manually review every edge case.

Before trusting a runtime upgrade, teams should check:

unit and integration tests for handler behavior,
dependency compatibility with the target runtime,
infrastructure definitions that set the runtime,
packaging and build pipeline changes,
observability after deployment,
alias or version-based rollback strategy.

For platform teams, the other trade-off is centralization. Pushing changes across many repositories can be efficient, but service owners still need to understand and approve behavior changes. The best model is usually centralized orchestration with distributed ownership.

What to do next

Start with inventory. Use AWS Config, Trusted Advisor, CLI scripts, tags, or deployment metadata to find functions approaching deprecation. Then group them by owner, runtime, business criticality, and test readiness.

For low-risk functions with good tests, an automated transform can move quickly. For high-risk functions with poor tests, the first transformation should probably add documentation, tests, or alarms before changing the runtime.

The lesson from this AWS post is broader than Lambda. Modernization work becomes manageable when it is continuous, visible, and validated. If runtime upgrades are still handled as emergency cleanup, the tooling can help, but the process needs upgrading too.