Serverless | The AWS Blog

Lambda durable functions fit the messy middle of agent workflows

Emiliano Montesdeoca — Mon, 29 Jun 2026 00:00:00 +0000

Agent workflows are rarely a neat chain of fast API calls. They wait for humans, retry model calls, poll external systems, compensate for failed steps, and burn money when the same expensive operation runs twice.

The AWS Compute Blog post on building fault-tolerant multi-agent AI workflows with AWS Lambda durable functions is useful because it focuses on that messy middle. The healthcare prior authorization example is domain-specific, but the orchestration problem is common.

What changed

Lambda durable functions extend the Lambda programming model with checkpoint and replay operations. The source article highlights patterns such as:

context.step() for checkpointed work,
callback waits for human review or external completion signals,
condition waits for polling long-running systems,
replay behavior that skips completed durable operations,
execution metrics and status events for operational visibility.

That means a long-running workflow can pause without paying compute charges during the wait, then resume from the right point when a callback arrives or a condition changes.

Why it matters for AI systems

Multi-agent workflows are expensive and non-deterministic. If an extraction agent, reasoning agent, or synthesis agent has already completed, you usually do not want a transient failure later in the flow to rerun everything from the start.

Durable checkpointing directly addresses that. It also makes human-in-the-loop patterns less awkward. A review step can suspend for hours or days without holding a running process open or building separate queue-and-database plumbing for every workflow.

This is not only an AI feature. It is orchestration for any process where work is valuable, waits are long, and retries must be controlled.

The architectural trade-off

The biggest benefit is also the thing to design carefully: workflow logic lives in code.

That can be cleaner than stitching together many small infrastructure pieces, but it requires discipline. Builders need deterministic step boundaries, idempotent external calls, clear timeout behavior, and replay-aware logging. If a step charges a credit card, submits a claim, sends an email, or updates a ticket, retries must reuse idempotency keys.

You also need to decide when Step Functions is still a better fit. If the team benefits from visual state machines, service integrations, explicit workflow definitions, or non-developer operators inspecting the flow, Step Functions remains strong. Durable functions are attractive when the orchestration is code-heavy, developer-owned, and benefits from staying close to the Lambda handler.

What builders should do next

Do not start by converting every workflow. Start with one painful process that has three properties:

expensive completed steps that should not repeat,
long waits for humans or external systems,
retry behavior that is currently implemented with custom glue.

Then design the durable function around failure cases first. What happens if an agent times out? If the human rejects the result? If the external API accepts a submission but the response is lost? If the workflow exceeds the business deadline?

The source example uses prior authorization, but the pattern applies to code review agents, document processing, procurement approvals, incident remediation, and migration assessment pipelines.

The practical takeaway: durable functions are not just about making Lambda run longer. They are about making long-running workflows resumable, observable, and cheaper to wait on.

OpenSearch Serverless next generation changes the economics of tenant isolation

Emiliano Montesdeoca — Wed, 24 Jun 2026 00:00:00 +0000

Multi-tenant search has always forced an uncomfortable trade-off: isolate tenants cleanly and pay for too much infrastructure, or pool tenants together and accept more operational and security complexity.

The AWS Big Data Blog post on multi-tenant search with Amazon OpenSearch Serverless next generation is important because it changes that cost model. Scale-to-zero compute and regional endpoint routing make collection-per-tenant designs more realistic.

What changed

OpenSearch Serverless next-generation architecture lets collection groups scale compute to zero, while storage charges still apply. AWS also added a per-account, regional endpoint that can route requests to collections using headers such as x-amz-aoss-collection-name.

That means a SaaS application can keep a cleaner collection-per-tenant model without managing a separate endpoint and connection pool for every tenant.

Why builders should care

Tenant isolation is easier to discuss in architecture reviews than to pay for in production.

A collection-per-tenant design gives strong isolation boundaries for data, workload behavior, encryption, lifecycle, and noisy-neighbor risk. But if each tenant carries a minimum always-on compute cost, the design breaks down for long-tail tenants.

Scale-to-zero compute makes the model more practical for SaaS platforms with many tenants that search occasionally. The regional endpoint also simplifies application routing. Instead of maintaining many endpoints, the application can target one endpoint and route by collection header.

The trade-offs

Collection-per-tenant is not automatically the best design.

For very small tenants with similar access patterns, pooled indexes may still be simpler. For very large tenants, dedicated collections or even separate domains may be appropriate. For regulated tenants, encryption, access policy, and audit boundaries may matter more than cost.

Builders should also design the tenant mapping layer carefully. A regional endpoint reduces endpoint sprawl, but the application still needs a reliable mapping from tenant ID to collection name or ID. That mapping becomes part of the security boundary.

Operational questions remain:

How are collections created and deleted?
What happens when a dormant tenant becomes active?
How are per-tenant quotas enforced?
How are index templates and mappings rolled out safely?
How are tenant-level costs reported?

What to do next

For SaaS search workloads, revisit the tenancy model. Compare pooled, collection-per-tenant, and hybrid approaches using real tenant distribution, not an average tenant.

A practical path is:

Put high-volume tenants in dedicated collections.
Keep small tenants pooled or grouped if isolation requirements allow it.
Use collection-per-tenant where security, compliance, or noisy-neighbor risk justifies the boundary.
Automate collection lifecycle and mapping changes from the start.

The takeaway is that OpenSearch Serverless is making stronger isolation less expensive. That does not remove design work, but it gives builders more room to choose isolation for good reasons instead of avoiding it because of minimum compute cost.

S3 Files makes Lambda file workflows simpler, but not automatically better

Emiliano Montesdeoca — Wed, 24 Jun 2026 00:00:00 +0000

A lot of Lambda code that works with S3 is not complicated because the business logic is complicated. It is complicated because the function has to download an object, manage /tmp, process the file, upload the result, and clean up after itself.

The AWS Compute Blog post on modernizing Lambda and S3 workloads with Amazon S3 Files shows a different model: mount an S3-backed file system and let the function use normal file paths.

That sounds small. For many workloads, it changes the shape of the code.

What changed

S3 Files lets a Lambda function mount an S3 bucket as a file system at a path such as /mnt/data. Code can open, read, write, list, and organize files using filesystem operations while S3 remains the durable backing store.

The source article uses examples such as image processing, ETL pipelines, and multi-agent shared workspaces. In each case, the function moves away from explicit S3 transfer code and toward direct file I/O.

That removes a surprising amount of glue:

no manual download before processing,
no upload step after writing output,
less /tmp capacity management,
fewer cleanup paths for failed invocations,
simpler handoffs between functions that share a workspace.

Why builders should care

The strongest use case is modernization of existing file-oriented libraries. Many image, document, ML, and data-processing libraries expect file paths. Without S3 Files, Lambda code often has to adapt object storage into temporary local files before real work can start.

S3 Files lets builders keep file-based code and still use S3 as the storage layer. That can make Lambda more attractive for workloads that previously moved to containers only because file handling became awkward.

The shared workspace pattern is also interesting for AI agents. If multiple Lambda functions collaborate on a session, a directory tree can be easier to reason about than a collection of object keys and serialized state blobs.

The trade-offs

I would not replace every S3 API call with file I/O.

Object APIs are still excellent when you want explicit object boundaries, event-driven processing, presigned URLs, lifecycle policies, replication, and direct control over metadata. File systems are easier for some code, but they can also hide transfer and consistency behavior behind familiar syntax.

Builders should validate:

throughput for large files,
behavior under high concurrency,
consistency expectations between producers and consumers,
VPC and mount target requirements,
IAM permissions for mount and write operations,
failure behavior when a function exits while writing.

Also remember that simpler code is not the same as simpler operations. If many functions share a writable workspace, you need naming conventions, cleanup policies, and safeguards against accidental overwrite.

What to do next

Look for Lambda functions where more code is spent on S3 transfers than on the actual business logic. Image resizing, report generation, data conversion, document processing, and agent workspaces are good candidates.

For each candidate, compare two versions:

current S3 API flow with /tmp,
S3 Files flow with mounted paths.

Measure duration, memory, error handling complexity, and cost. The winning design may be different per workload.

The useful takeaway is not that file systems are better than object APIs. It is that Lambda now has a cleaner option when the workload is naturally file-shaped.

Lambda MicroVMs make isolated sandboxes a serverless design choice

Emiliano Montesdeoca — Mon, 22 Jun 2026 00:00:00 +0000

AWS Lambda MicroVMs are interesting because they do not try to replace normal Lambda functions. They fill a different gap: workloads where the unit of isolation is not an event, but a user session, coding environment, agent run, scanner job, or other stateful sandbox.

The AWS announcement frames this around isolated sandboxes with full lifecycle control. That is the right framing. The practical value is not only that Firecracker provides VM-level isolation. It is that AWS is exposing a managed way to create, pause, resume, and retire those environments without asking every product team to become a virtualization platform team.

What changed

Lambda MicroVMs add a serverless compute primitive inside the Lambda family for running code in isolated, stateful execution environments. The source article describes several important properties:

each session can run in its own Firecracker-backed MicroVM,
environments can launch and resume from pre-initialized snapshots,
memory, disk, and running process state can survive during the session,
idle environments can be suspended by policy,
applications get a dedicated endpoint and short-lived request authentication.

That combination matters for applications that cannot fit cleanly into stateless request-response functions. A code interpreter, browser automation sandbox, vulnerability scanner, AI coding assistant, data notebook, or game scripting environment often needs process state and filesystem state between interactions.

Why builders should care

The old decision tree was uncomfortable. Virtual machines gave strong isolation but slow startup and more operations. Containers started quickly but shared a kernel, which raises the bar for safely running untrusted code. Lambda functions were operationally simple but not designed for long-running interactive state.

Lambda MicroVMs create a new middle path. For builders, the design conversation becomes more precise:

Use Lambda functions for event handlers and short stateless tasks.
Use containers when you need packaging flexibility and can manage isolation risk.
Use Lambda MicroVMs when each tenant, user, or agent run needs a dedicated stateful sandbox.

This is especially relevant for AI systems. As more applications let agents write code, execute tools, inspect repositories, or process customer files, isolation becomes part of the product boundary. A prompt injection bug should not become a cross-tenant file access bug.

The trade-offs are still real

MicroVMs reduce a lot of infrastructure work, but they do not remove architecture responsibility.

First, lifecycle policy becomes a cost control. If idle sessions stay warm too long, the bill can drift. If they suspend too aggressively, users feel resume latency. Teams should treat idle duration as a product setting, not a default copied from a sample.

Second, snapshot-based startup changes how applications initialize. Code that generates unique state, opens long-lived external connections, or assumes initialization happens once per user action needs careful review.

Third, stateful sandboxes need cleanup rules. Temporary files, credentials, downloaded packages, generated artifacts, and logs can accumulate. Builders should define what survives during a session, what is exported, and what is destroyed.

Finally, security does not stop at VM boundaries. The execution role, outbound network policy, source artifact pipeline, token handling, and tenant mapping are still part of the isolation story.

What to do next

I would start with workloads where the current workaround is obviously expensive: per-user EC2 sandboxes, over-hardened container runners, or Lambda workflows full of awkward /tmp and rehydration logic.

For a proof of concept, validate four things before celebrating:

Cold launch and resume behavior with your real image size and dependencies.
Idle cost profile for normal user behavior, not a synthetic benchmark.
Tenant boundary tests for filesystem, process, network, and IAM access.
Failure cleanup when a session crashes, times out, or is abandoned.

Lambda MicroVMs are not just another Lambda feature. They are AWS acknowledging that the next wave of serverless workloads includes interactive, stateful, sometimes untrusted execution. That is a useful primitive, as long as teams treat it as an isolation architecture rather than a shortcut around security design.