The AWS Blog

ACM ACME support turns certificate automation into a governance problem

Emiliano Montesdeoca — Tue, 30 Jun 2026 00:00:00 +0000

Certificate automation is becoming less optional. Validity periods are getting shorter, and manual renewal processes are exactly the kind of quiet operational risk that eventually becomes an outage.

AWS has now added ACME support for public certificates in AWS Certificate Manager. The headline is compatibility with ACMEv2 clients such as Certbot, cert-manager, and acme.sh. The bigger builder impact is that certificate issuance can become standardized without moving governance outside AWS.

What changed

ACM now provides a managed ACME server endpoint for public certificates issued by Amazon Trust Services. Teams can point existing ACME clients at ACM, request certificates through the protocol they already know, and still have ACM visibility, CloudTrail logging, CloudWatch metrics, and certificate expiry notifications.

The source article also highlights two controls that matter in real organizations:

domain scopes at the endpoint level,
External Account Binding credentials for client registration.

That means a central PKI or platform team can validate domains once, restrict what clients can request, and avoid distributing DNS credentials to every application team.

Why this matters

Most certificate incidents are not caused by cryptography being hard. They are caused by ownership being unclear.

One team owns DNS. Another owns the ingress controller. Another owns the application. Someone installed a certificate manually two years ago. Nobody remembers the renewal path until browsers start showing errors.

ACME support in ACM gives builders a cleaner operating model:

Kubernetes teams can keep using cert-manager.
VM and container teams can keep using common ACME clients.
Platform teams can centralize issuance policy and audit.
Security teams can see certificates in ACM instead of chasing external CA dashboards.

That is a good trade: keep the open protocol at the edge, centralize governance at the account or organization boundary.

The architecture trade-offs

This does not remove the need for certificate lifecycle design.

First, decide who owns ACME endpoints. If every workload account creates its own endpoint with broad wildcard permissions, you have recreated the sprawl problem. A better pattern is to define endpoints per domain boundary, environment, or platform domain, then scope allowed names carefully.

Second, treat External Account Binding credentials like bootstrap secrets. They should be short-lived where possible, stored securely, and rotated when registration flows change.

Third, think about wildcard issuance. Wildcards are convenient, but they expand blast radius. Many production environments should prefer exact domains and subdomains unless there is a strong operational reason.

Finally, price and volume behavior matter. Moving high-churn ACME issuance into ACM should be costed, especially for systems that create many short-lived certificates across tenants.

What builders should do next

If I were running a platform team, I would start with an inventory:

Which public certificates are already in ACM?
Which are issued by external ACME providers?
Which workloads use cert-manager or custom renewal scripts?
Which teams still depend on manual DNS validation?

Then I would pilot ACM ACME with one non-critical domain and one existing ACME client. The goal is not only to issue a certificate. The goal is to prove the operating model: endpoint ownership, scope policy, audit trail, renewal alarms, and incident response.

The useful part of this launch is not that Certbot can talk to AWS. It is that certificate automation can finally be treated as platform governance instead of a collection of one-off scripts.

Bedrock managed entitlements make model access a platform control

Emiliano Montesdeoca — Tue, 30 Jun 2026 00:00:00 +0000

AI model access becomes messy as soon as an organization moves beyond one account and one team. Some models are available directly. Others require AWS Marketplace subscriptions. Workload accounts need access, but broad Marketplace permissions are rarely what security teams want.

The AWS Machine Learning Blog post on managed entitlements for Amazon Bedrock models is important because it turns model access into a platform-governance problem instead of an account-by-account chore.

What changed

Managed entitlements let a central account subscribe to supported third-party Bedrock models distributed through AWS Marketplace and share access with member accounts using AWS License Manager. Workload accounts can use the model access without needing direct AWS Marketplace subscription permissions.

This is especially useful for models such as Anthropic, Cohere, AI21 Labs, or Stability AI when they are distributed through Marketplace and used across many accounts.

Why builders should care

A healthy multi-account AI platform needs two things at the same time:

teams can access approved models quickly,
the organization can govern subscriptions, pricing, visibility, and permissions centrally.

Without a central entitlement pattern, every workload account becomes a small procurement and governance island. That slows adoption and creates inconsistency. With managed entitlements, a platform team can subscribe once, distribute access intentionally, and keep workload accounts away from broad Marketplace permissions.

This also helps with private offers. If pricing and terms are negotiated centrally, model access should follow that central agreement rather than being recreated account by account.

The trade-offs

Managed entitlements are not needed for every Bedrock model. Amazon models and some partner models may already be available without Marketplace subscription overhead. Single-account teams may not need this complexity.

For larger organizations, the main design work is governance:

Who approves model subscriptions?
Which accounts receive grants?
How are Regions handled?
How are private offers tracked?
How is model use monitored against budget and policy?
What is the offboarding process when a model is no longer approved?

Access distribution is only one layer. Teams still need IAM permissions, guardrails, logging, evaluation, and cost controls around actual model invocation.

What to do next

Inventory current Bedrock model usage by account. Identify which models require Marketplace subscriptions and which accounts have Marketplace permissions only because they needed model access.

Then pilot managed entitlements with one approved third-party model and a small set of workload accounts. Validate the subscription flow, grant distribution, regional behavior, billing visibility, and access removal.

The practical takeaway is that AI platforms need the same governance maturity as any other shared platform capability. Managed entitlements give AWS organizations a cleaner control point for model access.

CloudFormation Express mode is about feedback loops, not just faster deploys

Emiliano Montesdeoca — Tue, 30 Jun 2026 00:00:00 +0000

CloudFormation Express mode is easy to read as a speed announcement. The more useful way to read it is as a feedback-loop announcement.

In the AWS News Blog post, AWS explains that Express mode lets CloudFormation complete when resource configuration has been applied instead of waiting for extended stabilization checks. That can make stack operations much faster, especially during iterative infrastructure work.

That is valuable, but only if teams understand what changed: CloudFormation is not making every resource ready sooner. It is changing when the deployment returns control to you.

What changed

Standard CloudFormation waits for stabilization checks after applying resource configuration. Those checks are useful when the next action depends on the resource being operational.

Express mode skips that waiting period and lets resources continue stabilizing in the background. AWS says this can reduce deployment time by up to four times and highlights examples where long wait periods around resources such as Lambda network interfaces become much shorter.

No template changes are required. Builders can use Express mode through the console, AWS CLI, SDKs, CDK, and other IaC workflows.

Where it helps most

This is strongest in development and platform engineering loops where the cost of waiting is high and the risk of background stabilization is low.

Good fits include:

iterative CDK or CloudFormation template development,
AI-assisted infrastructure generation where an agent needs quick validation cycles,
sandbox environments where failed or incomplete resources can be cleaned up,
inner-loop testing of isolated infrastructure components,
deletes that currently block workflows while AWS finishes cleanup.

The bigger win is not shaving seconds from one deployment. It is reducing the friction that makes engineers avoid small infrastructure changes. Faster feedback encourages smaller changes, and smaller changes are easier to review and recover.

Where I would be careful

Express mode should not become the default answer for every production deployment.

If your pipeline shifts traffic, starts integration tests, runs database migrations, or declares a release successful immediately after CloudFormation returns, stabilization still matters. A resource can be configured but not yet ready for the next dependency in your delivery process.

The practical guardrail is simple: separate configuration applied from service ready. Express mode can own the first signal. Your pipeline still needs health checks, smoke tests, alarms, and rollback logic for the second.

Also note the rollback behavior. The source article says Express mode disables rollback by default for the fastest iteration experience, with options to re-enable rollback. That is sensible for local iteration. For production, it should be an explicit decision with cleanup and monitoring in place.

Builder checklist

Before adopting Express mode broadly, I would define three deployment profiles:

Inner loop: Express mode allowed, rollback optional, cleanup automated.
Pre-production: Express mode allowed only when smoke tests verify readiness after stack completion.
Production: Express mode allowed only for stacks where downstream steps do not assume full stabilization, or where readiness gates replace CloudFormation waiting.

Also update runbooks. If a deployment returns faster but a resource continues stabilizing, on-call engineers need to know where to look: CloudFormation events, service-specific logs, metrics, and alarms.

The practical takeaway

CloudFormation Express mode is not a license to ignore readiness. It is a way to stop paying stabilization latency when your workflow does not need CloudFormation to wait.

Used deliberately, it can make infrastructure development feel less heavy and make AI-assisted IaC workflows more practical. Used casually, it can move waiting from CloudFormation into your users’ first request. The difference is whether your pipeline has an explicit readiness gate after the stack operation completes.

CloudFormation pre-deployment validation makes IaC failures cheaper

Emiliano Montesdeoca — Tue, 30 Jun 2026 00:00:00 +0000

The best deployment failure is the one that fails before it starts changing infrastructure.

AWS has announced that CloudFormation pre-deployment validation now runs automatically on every stack create and update operation. The same post also introduces new checks and a cdk validate workflow.

This is not as visually exciting as a new service. It is more useful than many new services.

What changed

Pre-deployment validation now runs on CloudFormation stack operations without requiring a separate setup. The source article mentions new checks for service quota limits, AWS Config Recorder conflicts, and ECR repository delete readiness. It also describes operation-level control and CDK integration through cdk validate.

For builders, this shifts some failure detection earlier in the pipeline. Instead of discovering quota or environment conflicts halfway through deployment, teams can catch them before the operation commits to a path.

Why it matters

Infrastructure failures are expensive because they consume time, create partial state, and distract teams from the real change they were trying to ship.

Quotas are a classic example. A template can be valid and still fail because the target account or Region does not have enough capacity for a resource type. A pre-deployment check that catches this earlier improves both developer experience and operational safety.

The CDK angle matters too. CDK users often think in application constructs while CloudFormation owns the final deployment. Bringing validation closer to the CDK workflow helps developers discover platform constraints before the change reaches a shared environment.

The trade-offs

Validation is not a proof of success. It can catch known classes of problems, but it cannot guarantee the application will work, the dependency will be healthy, or the rollout will be safe.

Builders still need:

template linting and policy checks,
least-privilege review,
drift detection,
integration tests,
deployment alarms,
rollback or remediation plans,
post-deploy smoke tests.

There is also a balance between strictness and velocity. If validation blocks legitimate emergency changes, teams will look for bypasses. The DisableValidation option should be treated like any other break-glass control: rare, logged, and reviewed.

What to do next

Add cdk validate or equivalent CloudFormation validation to pull requests and pre-merge pipelines where it makes sense. Then track validation failures as a signal. If many teams hit the same quota or Config issue, the platform needs a baseline fix, not repeated ticket handling.

For production, make validation one gate in a broader release process. It should happen before deployment, but readiness should still be proven after deployment.

The practical takeaway: pre-deployment validation makes IaC failures cheaper and faster to understand. It does not remove the need for good deployment engineering, but it moves a useful class of mistakes to the left.

Replicating S3 bucket configuration needs workflow discipline

Emiliano Montesdeoca — Tue, 30 Jun 2026 00:00:00 +0000

Replicating S3 data is only part of a multi-Region storage strategy. The bucket configuration around that data is often where drift hides.

The AWS Storage Blog post on replicating Amazon S3 bucket configurations across AWS Regions with AWS Step Functions shows an automation pattern for replaying bucket configuration into a target Region with an auditable workflow.

That is useful, but it also raises an important architecture question: should configuration replication be a workflow, or should it be infrastructure as code?

What changed

The source article describes a Step Functions and Lambda solution that creates a bucket in a target Region and applies configuration from a source bucket. It logs runs to DynamoDB and CloudWatch, which gives operators an audit trail.

This kind of workflow can help when teams need to replicate settings such as encryption, lifecycle, versioning, event notifications, tags, or other operational configuration across Regions.

Why builders should care

Disaster recovery plans often assume that a bucket in another Region is ready because replication is configured. But during a real failover, missing configuration can break applications or weaken controls.

Examples:

lifecycle rules are missing and costs grow,
event notifications do not trigger downstream processing,
encryption or bucket policy differs from the primary Region,
observability tags are absent,
access points or integration settings are inconsistent.

A repeatable replication workflow can turn those assumptions into something testable.

The trade-off with IaC

For stable environments, infrastructure as code should usually be the source of truth. If the bucket configuration is defined in CDK, CloudFormation, Terraform, or Pulumi, the cleanest replication path is often to deploy the same intent to another Region.

A workflow-based replication tool is valuable when:

buckets already exist and need operational synchronization,
configuration is discovered from a source environment,
teams need an emergency or transitional DR path,
there are many legacy buckets not yet under IaC,
operators need a controlled copy action with audit logs.

The risk is creating a second source of truth. If IaC says one thing and the replication workflow copies another, drift becomes harder to reason about.

What to do next

Before using this pattern, classify buckets into two groups:

IaC-owned buckets where configuration should be generated from code.
Operationally managed buckets where a replication workflow can reduce drift until IaC ownership exists.

Then run regular DR validation. Do not only check that the target bucket exists. Check whether the target bucket has the policies, notifications, lifecycle rules, encryption settings, tags, and observability hooks needed for the application to run.

The useful takeaway is that S3 resilience is not just object replication. It is configuration repeatability. Step Functions can provide a controlled workflow for that, as long as builders are clear about the source of truth.

Faster S3 access log queries make storage security more usable

Emiliano Montesdeoca — Mon, 29 Jun 2026 00:00:00 +0000

S3 access logs are valuable, but value delayed often becomes value ignored. If logs are difficult to query, teams use them only after something has already gone wrong.

The AWS Storage Blog post on querying Amazon S3 access logs instantly with CloudWatch and S3 Tables is interesting because it focuses on making access data easier to use in everyday operations.

What changed

The article walks through delivering S3 access log data to CloudWatch Logs and S3 Tables, then using those destinations for dashboards, alarms, and queries. The practical shift is from “logs exist somewhere” to “logs are queryable enough to support investigation and monitoring.”

For builders, that means S3 access patterns can become part of normal observability instead of a separate forensic workflow.

Why it matters

S3 is often where the most important data lives. Application logs, analytics exports, customer files, model artifacts, backups, and operational reports all end up in buckets. Yet many teams still monitor bucket configuration more carefully than bucket access behavior.

Queryable access logs help answer questions that matter:

Which principals are reading sensitive prefixes?
Did access spike after a deployment?
Are clients getting unexpected 403 or 404 responses?
Which workloads are driving request cost?
Did a suspicious IP enumerate objects?
Are lifecycle or replication assumptions visible in traffic?

When these questions are cheap to answer, teams ask them earlier.

The trade-offs

More logging is not free. S3 access log delivery, CloudWatch ingestion, query volume, table storage, retention, and dashboards all have cost implications. The right design depends on the bucket’s risk and traffic profile.

I would not send every low-value development bucket into a high-retention analytics pipeline. I would prioritize buckets that contain customer data, security logs, production artifacts, financial exports, backups, or AI training and retrieval data.

Also, access logs are only one layer. They should complement CloudTrail data events, IAM Access Analyzer, Macie, GuardDuty, S3 Storage Lens, and application-level audit logs. Different signals answer different questions.

What to do next

Pick a production bucket that matters and define three operational queries before building anything. For example:

top readers by prefix over the last hour,
denied requests by principal and source IP,
unusual data transfer volume compared with baseline.

Then build the logging path and dashboard around those questions. Add alarms only where the signal is actionable; noisy storage alarms quickly become invisible.

The main takeaway is that storage observability should be designed for regular use, not just incident response. Faster S3 access log queries make that much more realistic.

Lambda durable functions fit the messy middle of agent workflows

Emiliano Montesdeoca — Mon, 29 Jun 2026 00:00:00 +0000

Agent workflows are rarely a neat chain of fast API calls. They wait for humans, retry model calls, poll external systems, compensate for failed steps, and burn money when the same expensive operation runs twice.

The AWS Compute Blog post on building fault-tolerant multi-agent AI workflows with AWS Lambda durable functions is useful because it focuses on that messy middle. The healthcare prior authorization example is domain-specific, but the orchestration problem is common.

What changed

Lambda durable functions extend the Lambda programming model with checkpoint and replay operations. The source article highlights patterns such as:

context.step() for checkpointed work,
callback waits for human review or external completion signals,
condition waits for polling long-running systems,
replay behavior that skips completed durable operations,
execution metrics and status events for operational visibility.

That means a long-running workflow can pause without paying compute charges during the wait, then resume from the right point when a callback arrives or a condition changes.

Why it matters for AI systems

Multi-agent workflows are expensive and non-deterministic. If an extraction agent, reasoning agent, or synthesis agent has already completed, you usually do not want a transient failure later in the flow to rerun everything from the start.

Durable checkpointing directly addresses that. It also makes human-in-the-loop patterns less awkward. A review step can suspend for hours or days without holding a running process open or building separate queue-and-database plumbing for every workflow.

This is not only an AI feature. It is orchestration for any process where work is valuable, waits are long, and retries must be controlled.

The architectural trade-off

The biggest benefit is also the thing to design carefully: workflow logic lives in code.

That can be cleaner than stitching together many small infrastructure pieces, but it requires discipline. Builders need deterministic step boundaries, idempotent external calls, clear timeout behavior, and replay-aware logging. If a step charges a credit card, submits a claim, sends an email, or updates a ticket, retries must reuse idempotency keys.

You also need to decide when Step Functions is still a better fit. If the team benefits from visual state machines, service integrations, explicit workflow definitions, or non-developer operators inspecting the flow, Step Functions remains strong. Durable functions are attractive when the orchestration is code-heavy, developer-owned, and benefits from staying close to the Lambda handler.

What builders should do next

Do not start by converting every workflow. Start with one painful process that has three properties:

expensive completed steps that should not repeat,
long waits for humans or external systems,
retry behavior that is currently implemented with custom glue.

Then design the durable function around failure cases first. What happens if an agent times out? If the human rejects the result? If the external API accepts a submission but the response is lost? If the workflow exceeds the business deadline?

The source example uses prior authorization, but the pattern applies to code review agents, document processing, procurement approvals, incident remediation, and migration assessment pipelines.

The practical takeaway: durable functions are not just about making Lambda run longer. They are about making long-running workflows resumable, observable, and cheaper to wait on.

Redshift multi-warehouse improvements reduce the analytics freshness trade-off

Emiliano Montesdeoca — Mon, 29 Jun 2026 00:00:00 +0000

Analytics platforms often fail at the boundary between ingestion and consumption. The business wants fresh data, analysts want fast dashboards, and data engineers do not want one workload starving the other.

The AWS Big Data Blog post on Amazon Redshift multi-warehouse enhancements is useful because it targets that exact tension. Redshift is adding capabilities around remote materialized views, remote table DDL, and concurrency scaling for ingestion paths.

What changed

The source article highlights several improvements:

materialized view operations can benefit from concurrency scaling,
materialized views can work more effectively across data sharing boundaries,
remote table DDL operations improve distributed warehouse management,
zero-ETL, auto-copy, and COPY workloads can use concurrency scaling.

The practical effect is that data loading, transformation, and dashboard workloads can be separated more cleanly without forcing every team into a single overloaded warehouse.

Why builders should care

Modern analytics environments are rarely one warehouse and one team. They include raw ingestion zones, curated datasets, consumer warehouses, BI dashboards, near-real-time application data, data science workspaces, and governance boundaries.

Data sharing helped separate producers and consumers. These enhancements make the separation more usable because operational work such as materialized view refreshes and ingestion can scale without stealing capacity from interactive analytics.

For builders, this reduces a common trade-off: choose fresher data and risk dashboard performance, or protect dashboards and accept slower ingestion.

The trade-offs

More warehouses and more scaling options can also mean more cost and governance complexity.

Teams should define workload classes clearly:

ingestion and copy workloads,
transformation and materialized view refresh,
interactive BI,
ad hoc analyst queries,
data science exploration,
governed consumer access.

Then assign cost ownership and performance targets to each class. Concurrency scaling is valuable, but it should not hide inefficient query design, poor distribution choices, or runaway workloads.

Remote operations also require careful permissions and change management. A consumer warehouse refreshing or building on shared data is convenient, but data contracts and ownership should remain explicit.

What to do next

Review Redshift environments where ingestion freshness and query performance are in conflict. Those are the best candidates for multi-warehouse design improvements.

Ask four questions:

Which workloads need isolation from each other?
Which data products should be shared rather than copied?
Which materialized views are critical to dashboard performance?
Which ingestion paths need concurrency scaling to avoid freshness delays?

The practical takeaway is that Redshift is making decentralized analytics architectures easier to operate. Builders should use that to create clearer workload boundaries, not just to add more capacity to the same messy warehouse.

Secure ML environments need productivity and exfiltration controls together

Emiliano Montesdeoca — Mon, 29 Jun 2026 00:00:00 +0000

Machine learning teams need access to sensitive data, but the old answer of isolated desktops and manual monitoring does not scale well. It is expensive, slow, and frustrating for the people trying to build models.

The AWS Architecture Blog post on preventing data exfiltration in machine learning environments with Amazon SageMaker AI is useful because it balances two goals that are often treated as opposites: data scientist productivity and strict exfiltration control.

What changed

The source article describes a three-layer architecture using Amazon WorkSpaces Secure Browser, SageMaker AI, VPC endpoints, DNS controls, endpoint policies, and account restrictions.

The pattern is straightforward:

data scientists access the environment through a controlled browser,
browser capabilities such as upload, download, clipboard, and printing are restricted,
access is limited to approved AWS and SageMaker domains,
SageMaker runs without direct internet egress,
VPC endpoint policies restrict access to organization-owned resources,
DNS Firewall and private endpoints reduce bypass paths.

This is not a single magic control. It is layered friction against data leaving through the browser, the network, AWS APIs, or cross-account paths.

Why builders should care

ML environments are uniquely risky because they combine sensitive data, powerful compute, notebooks, terminals, package installation, model artifacts, and curious users. Blocking everything makes teams ineffective. Allowing everything creates obvious exfiltration paths.

A layered architecture gives teams a middle path. Data scientists can use managed ML tooling while the platform enforces where data can flow.

This is especially relevant for fintech, healthcare, insurance, public sector, and any organization fine-tuning or evaluating models on sensitive datasets.

The trade-offs

Every restriction has a productivity cost.

No internet access means dependency management must be planned. Package mirrors, approved repositories, model artifact flows, and update processes become platform responsibilities. URL allowlists must be maintained. Endpoint policies must be tested. Break-glass workflows must exist for legitimate exceptions.

There is also a risk of building controls that look strong but are not complete. If S3 endpoint policies block external buckets but another service path remains open, data can still move. If DNS Firewall covers only part of the environment, bypasses may remain.

What to do next

Start by mapping data egress paths, not by choosing tools. For an ML workspace, list how data could leave:

browser download,
clipboard or print,
external website upload,
S3 copy to another account,
package manager or shell command,
notebook output,
model artifact export,
DNS or HTTPS tunnel.

Then put controls around each path and test them with realistic user workflows. Involve data scientists early; controls that break normal work will be bypassed or abandoned.

The practical takeaway: secure ML platforms should not be air-gapped by default or open by convenience. They should make the safe path productive and the unsafe path difficult.

S3 Storage Lens groups make storage cost conversations less generic

Emiliano Montesdeoca — Fri, 26 Jun 2026 00:00:00 +0000

Storage optimization advice often starts too broadly: reduce old data, review access patterns, apply lifecycle policies. That is true, but it is not specific enough for a team to act.

The AWS Storage Blog post on S3 Storage Lens groups is useful because it shows how to group storage by workload-specific criteria instead of looking only at account or bucket totals.

What changed

S3 Storage Lens groups let builders define custom groupings and analyze metrics for targeted slices of S3 data. The source article uses examples such as older application logs and aging image files across multiple buckets and accounts.

That matters because the useful unit of storage ownership is rarely just a bucket. It may be an application, dataset, tenant, media type, compliance class, or product feature.

Why builders should care

S3 cost and hygiene problems are usually ownership problems.

A central cloud team can see that storage is growing. They may not know which logs are safe to expire, which images are user-generated content, which datasets are legally retained, or which prefixes belong to an old migration. Application teams know the context, but they often lack cross-account visibility.

Storage Lens groups can bridge that gap. They help create a view that says, “this workload has 40 TB of logs older than 180 days,” not just “this account has 400 TB in S3.”

That turns a generic optimization request into a practical backlog item.

The trade-offs

Custom groups are only useful if the grouping logic reflects real ownership and lifecycle rules. If tags, prefixes, naming conventions, or account boundaries are inconsistent, the dashboard can give false confidence.

Teams should avoid building too many groups at once. Start with questions that lead to action:

Which data can move to a colder tier?
Which prefixes have no recent access?
Which workloads are growing faster than expected?
Which buckets contain small-object patterns that inflate request costs?
Which teams own the largest retained datasets?

Also remember that visibility is not enforcement. Storage Lens can show the opportunity, but lifecycle rules, retention policies, and application changes implement the fix.

What to do next

Pick one high-cost storage area and define a group around the way the business thinks about it. For example: production application logs older than 90 days, media originals by product, or export datasets by tenant.

Then review the metrics with the owning team and decide on one action: lifecycle transition, deletion after retention, object compaction, prefix redesign, or access pattern review.

The practical value of Storage Lens groups is not better charts. It is better conversations between platform, finance, security, and application teams about the specific data that is driving cost and risk.

Running pgvector on Aurora is a production operations decision

Emiliano Montesdeoca — Thu, 25 Jun 2026 00:00:00 +0000

It is easy to prototype vector search. It is harder to operate it after users, documents, embeddings, and retrieval patterns start changing every day.

The AWS Database Blog post on running pgvector in production on Amazon Aurora PostgreSQL is useful because it moves the conversation away from “can I store embeddings?” and toward “can I keep this retrieval system healthy?”

What changed

The source article covers operational practices for pgvector workloads on Aurora PostgreSQL: choosing index types and distance functions, managing HNSW behavior, using quantization and partitioning, sizing memory, and monitoring the signals that show when the vector store is drifting out of shape.

That is the right level of discussion for production RAG systems. The database is not just a place to put vectors. It is part of the user-facing latency, relevance, and cost profile.

Why builders should care

Aurora PostgreSQL with pgvector is attractive because many teams already understand PostgreSQL. They can keep relational data, metadata filters, access patterns, and embeddings close together. That reduces architecture sprawl for early and mid-sized AI applications.

But familiarity can hide risk. Vector indexes have different maintenance behavior than normal B-tree indexes. Embedding dimensions affect memory. Update and delete patterns can degrade index quality. Query filters can change recall and latency. The database may need to serve both transactional and retrieval traffic.

If you treat pgvector like a small column type, production will teach you otherwise.

The trade-offs

The main decision is managed abstraction versus self-managed control.

Aurora PostgreSQL with pgvector gives control over schema, SQL, transactions, and tuning. That is valuable when retrieval is tightly coupled to application data. Amazon Bedrock Knowledge Bases or other managed retrieval systems reduce operational burden, which can be better when the team does not need direct database-level control.

There is no universal winner. Choose pgvector on Aurora when PostgreSQL integration is a real product advantage. Choose a more managed path when the team mostly wants ingestion, embedding, retrieval, and scaling handled for them.

What to do next

Before putting pgvector-backed retrieval into production, define operational checks:

index type and distance metric per use case,
expected vector count and growth rate,
memory needed to keep hot indexes healthy,
update and deletion behavior,
query latency percentiles under realistic filters,
recall evaluation for representative prompts,
vacuum and maintenance expectations,
fallback behavior when retrieval fails or gets slow.

Also separate prototype metrics from production metrics. A demo with 10,000 documents says little about a system with millions of vectors, concurrent users, and evolving embeddings.

The practical takeaway is simple: pgvector on Aurora can be a strong architecture choice, but only if the team is ready to operate vector search as a database workload, not as a model configuration checkbox.

AWS Transform makes migration assessments more conversational, but data quality still wins

Emiliano Montesdeoca — Wed, 24 Jun 2026 00:00:00 +0000

Migration assessments are rarely wrong because the spreadsheet formula is hard. They are wrong because the inventory is incomplete, assumptions are stale, and the business case changes every time a stakeholder adds context.

The AWS Migration & Modernization Blog post on accelerating migration assessments and planning with AWS Transform is interesting because it treats assessment as an interactive workflow instead of a static report.

What changed

AWS Transform assessments can ingest inventory data from common discovery sources, generate a TCO business case, recommend AWS targets, and then let teams refine assumptions through chat. The source article shows examples such as adding missing servers, adjusting on-premises costs, modeling non-EC2 services, and exploring what-if scenarios.

That changes the assessment loop. Instead of recollecting data and regenerating a report for every new question, teams can refine the model conversationally.

Why builders should care

Migration planning is full of partial information. A CMDB misses servers. RVTools exports do not capture every cost. Licensing assumptions change. Some workloads are heading to managed services, not EC2. Finance wants a different depreciation model. Application owners remember a dependency late in the process.

An interactive assessment tool can keep momentum when those changes happen. That is valuable during early planning, executive conversations, and portfolio prioritization.

But the output is only as good as the inputs and assumptions.

The trade-offs

Conversational planning can make weak assumptions feel more authoritative than they are. A polished business case is still a model.

Builders and migration leads should keep three things explicit:

Data source quality: where inventory came from, when it was collected, and what is missing.
Assumptions: utilization, licensing, storage growth, network costs, support model, and target services.
Decision status: directional estimate, planning baseline, or approved migration wave.

AWS Transform can help generate and adjust the assessment, but it should not replace stakeholder validation. Application owners, finance, security, and operations still need to review the model.

What to do next

Use AWS Transform early, but label the results honestly. Start with directional estimates, then improve confidence as discovery data gets better.

A practical workflow:

Import existing inventory instead of waiting for perfect data.
Generate a first-pass business case.
Use chat to add known missing costs and workloads.
Review assumptions with finance and platform teams.
Convert high-confidence groups into migration waves.
Reassess after each wave to improve the next one.

The useful takeaway is that migration assessments can become living documents. That is a big improvement over static PDFs, as long as teams keep data quality and assumption ownership visible.

OpenSearch Serverless next generation changes the economics of tenant isolation

Emiliano Montesdeoca — Wed, 24 Jun 2026 00:00:00 +0000

Multi-tenant search has always forced an uncomfortable trade-off: isolate tenants cleanly and pay for too much infrastructure, or pool tenants together and accept more operational and security complexity.

The AWS Big Data Blog post on multi-tenant search with Amazon OpenSearch Serverless next generation is important because it changes that cost model. Scale-to-zero compute and regional endpoint routing make collection-per-tenant designs more realistic.

What changed

OpenSearch Serverless next-generation architecture lets collection groups scale compute to zero, while storage charges still apply. AWS also added a per-account, regional endpoint that can route requests to collections using headers such as x-amz-aoss-collection-name.

That means a SaaS application can keep a cleaner collection-per-tenant model without managing a separate endpoint and connection pool for every tenant.

Why builders should care

Tenant isolation is easier to discuss in architecture reviews than to pay for in production.

A collection-per-tenant design gives strong isolation boundaries for data, workload behavior, encryption, lifecycle, and noisy-neighbor risk. But if each tenant carries a minimum always-on compute cost, the design breaks down for long-tail tenants.

Scale-to-zero compute makes the model more practical for SaaS platforms with many tenants that search occasionally. The regional endpoint also simplifies application routing. Instead of maintaining many endpoints, the application can target one endpoint and route by collection header.

The trade-offs

Collection-per-tenant is not automatically the best design.

For very small tenants with similar access patterns, pooled indexes may still be simpler. For very large tenants, dedicated collections or even separate domains may be appropriate. For regulated tenants, encryption, access policy, and audit boundaries may matter more than cost.

Builders should also design the tenant mapping layer carefully. A regional endpoint reduces endpoint sprawl, but the application still needs a reliable mapping from tenant ID to collection name or ID. That mapping becomes part of the security boundary.

Operational questions remain:

How are collections created and deleted?
What happens when a dormant tenant becomes active?
How are per-tenant quotas enforced?
How are index templates and mappings rolled out safely?
How are tenant-level costs reported?

What to do next

For SaaS search workloads, revisit the tenancy model. Compare pooled, collection-per-tenant, and hybrid approaches using real tenant distribution, not an average tenant.

A practical path is:

Put high-volume tenants in dedicated collections.
Keep small tenants pooled or grouped if isolation requirements allow it.
Use collection-per-tenant where security, compliance, or noisy-neighbor risk justifies the boundary.
Automate collection lifecycle and mapping changes from the start.

The takeaway is that OpenSearch Serverless is making stronger isolation less expensive. That does not remove design work, but it gives builders more room to choose isolation for good reasons instead of avoiding it because of minimum compute cost.

Restricting AWS Console access by network is a useful perimeter, not a complete identity strategy

Emiliano Montesdeoca — Wed, 24 Jun 2026 00:00:00 +0000

Identity controls usually answer who can sign in. Network-aware console controls help answer where that sign-in is allowed to happen.

The AWS Security Blog post on restricting AWS Management Console access to expected networks with sign-in resource-based policies and RCPs is important for organizations building data perimeters. It gives security teams another way to reduce the usefulness of stolen credentials.

What changed

AWS now supports patterns where Management Console sign-in can be restricted based on expected networks using sign-in resource-based policies and AWS Organizations resource control policies. The source article walks through a financial services scenario, CloudTrail verification, Console Private Access integration, and data perimeter alignment.

The builder impact is straightforward: console access can be constrained closer to the network boundary, not only by identity and MFA.

Why it matters

A compromised credential should not work equally well from anywhere on the internet. Network restrictions are not perfect, but they raise the bar and reduce the blast radius of many common incidents.

This is especially relevant for:

regulated environments,
privileged administration accounts,
break-glass access patterns,
organizations using centralized corporate egress,
teams implementing broader AWS data perimeter controls.

When combined with IAM Identity Center, MFA, least privilege, CloudTrail, and anomaly detection, expected-network policies add another useful checkpoint.

The trade-offs

Network conditions change. Employees travel, VPNs fail, disaster recovery procedures happen, and incident responders may need access from unusual locations. If console network restrictions are too rigid, they can become an availability risk during exactly the moment when access matters most.

I would treat this as a tiered control:

strict policies for highly privileged production administration,
documented exceptions for break-glass roles,
tested access paths for incident response,
monitoring for denied attempts,
clear ownership of allowed network ranges.

Also, do not confuse console restrictions with API restrictions. Many automation paths use AWS APIs, not the web console. A complete perimeter strategy must cover both human console access and programmatic access patterns.

What to do next

Start with privileged accounts and roles, not every user at once. Define the expected networks, verify them with CloudTrail, and test both allowed and denied sign-in paths.

Then rehearse failure cases:

corporate VPN unavailable,
emergency access required,
identity provider outage,
support engineer working from a different location,
network range changed but policy not updated.

The practical takeaway is that network-aware console access is a strong additional guardrail. It should complement identity, not replace it. Used carefully, it makes stolen credentials less useful without making legitimate operations fragile.

S3 Files makes Lambda file workflows simpler, but not automatically better

Emiliano Montesdeoca — Wed, 24 Jun 2026 00:00:00 +0000

A lot of Lambda code that works with S3 is not complicated because the business logic is complicated. It is complicated because the function has to download an object, manage /tmp, process the file, upload the result, and clean up after itself.

The AWS Compute Blog post on modernizing Lambda and S3 workloads with Amazon S3 Files shows a different model: mount an S3-backed file system and let the function use normal file paths.

That sounds small. For many workloads, it changes the shape of the code.

What changed

S3 Files lets a Lambda function mount an S3 bucket as a file system at a path such as /mnt/data. Code can open, read, write, list, and organize files using filesystem operations while S3 remains the durable backing store.

The source article uses examples such as image processing, ETL pipelines, and multi-agent shared workspaces. In each case, the function moves away from explicit S3 transfer code and toward direct file I/O.

That removes a surprising amount of glue:

no manual download before processing,
no upload step after writing output,
less /tmp capacity management,
fewer cleanup paths for failed invocations,
simpler handoffs between functions that share a workspace.

Why builders should care

The strongest use case is modernization of existing file-oriented libraries. Many image, document, ML, and data-processing libraries expect file paths. Without S3 Files, Lambda code often has to adapt object storage into temporary local files before real work can start.

S3 Files lets builders keep file-based code and still use S3 as the storage layer. That can make Lambda more attractive for workloads that previously moved to containers only because file handling became awkward.

The shared workspace pattern is also interesting for AI agents. If multiple Lambda functions collaborate on a session, a directory tree can be easier to reason about than a collection of object keys and serialized state blobs.

The trade-offs

I would not replace every S3 API call with file I/O.

Object APIs are still excellent when you want explicit object boundaries, event-driven processing, presigned URLs, lifecycle policies, replication, and direct control over metadata. File systems are easier for some code, but they can also hide transfer and consistency behavior behind familiar syntax.

Builders should validate:

throughput for large files,
behavior under high concurrency,
consistency expectations between producers and consumers,
VPC and mount target requirements,
IAM permissions for mount and write operations,
failure behavior when a function exits while writing.

Also remember that simpler code is not the same as simpler operations. If many functions share a writable workspace, you need naming conventions, cleanup policies, and safeguards against accidental overwrite.

What to do next

Look for Lambda functions where more code is spent on S3 transfers than on the actual business logic. Image resizing, report generation, data conversion, document processing, and agent workspaces are good candidates.

For each candidate, compare two versions:

current S3 API flow with /tmp,
S3 Files flow with mounted paths.

Measure duration, memory, error handling complexity, and cost. The winning design may be different per workload.

The useful takeaway is not that file systems are better than object APIs. It is that Lambda now has a cleaner option when the workload is naturally file-shaped.

EKS Auto Mode improvements show why managed Kubernetes is becoming operational engineering

Emiliano Montesdeoca — Tue, 23 Jun 2026 00:00:00 +0000

The latest EKS Auto Mode update is not one big feature. It is a collection of operational improvements: faster node startup, better Karpenter behavior, node-local DNS, smoother EBS migration, and more networking options.

That is exactly why it matters.

In the AWS Containers Blog post, AWS describes improvements across runtime, compute, storage, and networking. For builders, the lesson is that managed Kubernetes value increasingly comes from reducing the number of sharp edges you have to own yourself.

What changed

The source article lists several practical changes:

node ready latency reduced through faster startup detection,
Karpenter scale-out and consolidation improvements,
zram protection for transient system memory pressure,
faster image pulls for large ML and GPU images,
node-local DNS by default in Auto Mode,
support for separate pod subnets and pod security groups,
improved EBS migration and topology-aware volume scheduling.

None of these changes remove the need to understand Kubernetes. They reduce the tax of operating the infrastructure layer below your workloads.

Why builders should care

Most Kubernetes incidents are not about Kubernetes being unavailable. They are about the cluster being technically alive while applications wait for capacity, DNS, storage, or networking to catch up.

Faster node startup helps bursty workloads. Better consolidation helps cost. Node-local DNS reduces a common hidden bottleneck. Separate pod subnets and security groups make enterprise network patterns easier to express without abandoning Auto Mode defaults.

For platform teams, this changes the migration conversation. Auto Mode is not only about “less to manage.” It is about whether AWS-managed operational improvements arrive faster than your team could safely build and maintain them.

The trade-offs

There are still boundaries.

Auto Mode can improve node lifecycle and system components, but it cannot fix bad pod requests, missing disruption budgets, slow application startup, or chatty service dependencies. If workloads request too much CPU, do not define readiness probes, or depend on large cold images, managed infrastructure only helps so far.

Cost also needs attention. Faster scale-out is useful, but it can surface inefficient autoscaling policies. Faster consolidation is useful, but only if workloads tolerate disruption and your budgets are modeled correctly.

Networking improvements should also be treated as architecture choices. Separate pod subnets and security groups can improve segmentation, but they introduce more routes, policies, IP planning, and troubleshooting paths.

What to do next

If you run EKS Auto Mode today, measure before and after. Look at:

pending pod seconds during scale events,
node ready time,
image pull latency for large containers,
DNS latency and CoreDNS saturation,
consolidation events and interruption impact,
cross-AZ and NAT gateway traffic.

If you are migrating to Auto Mode, do it workload by workload. Start with stateless services that have clean readiness probes, pod disruption budgets, and known resource requests. Then move stateful or network-sensitive workloads after validating storage topology and security group behavior.

The best outcome is not simply fewer Kubernetes knobs. It is a platform where the knobs that remain are closer to application reliability, cost, and security decisions. That is where builders should spend their time.

EKS control plane egress through your VPC closes a real private-cluster gap

Emiliano Montesdeoca — Mon, 22 Jun 2026 00:00:00 +0000

Private EKS clusters have always had two sides: how clients and nodes reach the API server, and how the API server reaches things on behalf of Kubernetes. The first side was easier to reason about. The second side had awkward edges.

AWS has announced customer-routed control plane egress for Amazon EKS, which routes customer-controllable Kubernetes API server outbound traffic through your VPC. That includes admission webhook callbacks, OIDC discovery, and aggregate API server requests.

This is a practical feature for teams that need private webhooks, private identity providers, and auditable network paths.

What changed

With the new controlPlaneEgressMode set to CUSTOMER_ROUTED, the Kubernetes API server uses an elastic network interface in your VPC for specific outbound flows. Those flows can then follow your routing, security groups, VPC endpoints, DNS, Network Firewall, PrivateLink, Direct Connect, and logging patterns.

EKS-managed service traffic still uses the EKS-managed path. The feature is scoped to customer-controllable API server egress, not every packet from the control plane.

One important detail: after a cluster uses CUSTOMER_ROUTED, the setting is immutable for the life of the cluster. That makes planning more important than experimentation on a random production cluster.

Why it matters

Admission webhooks are often part of the security boundary. They validate images, enforce labels, inject sidecars, block risky configurations, and integrate with policy engines. If the API server can only call a public endpoint, teams end up exposing services that they would rather keep private.

The same issue appears with external OIDC identity providers. If the control plane must fetch discovery documents and JWKS over an internet-reachable path, the cluster is not as private as the architecture diagram suggests.

Customer-routed egress makes a cleaner design possible:

webhook services can live behind internal load balancers,
private DNS can resolve API server dependencies,
VPC Flow Logs can show the path,
SCPs can enforce the required egress mode across accounts,
network teams can apply existing inspection and routing controls.

For regulated environments, the value is not only connectivity. It is evidence.

Design considerations

This feature moves responsibility to your network design. If DNS, routes, endpoint policies, or security groups are wrong, API server calls can fail. That can break admissions, identity association, or aggregated APIs.

I would pay attention to four areas:

DNS resolution. The API server now depends on the DNS path available from your VPC configuration for those customer-controllable names.
Webhook availability. A private webhook outage can become a cluster-wide admission outage if failure policies are strict.
Certificate trust. Private does not always mean privately trusted. The source article notes that OIDC issuer certificates still need a publicly trusted chain.
Cost and routing. NAT gateways, cross-AZ paths, inspection appliances, and endpoints can add cost or latency if the path is not designed deliberately.

What to do next

Start by identifying clusters that already use admission webhooks, external OIDC providers, or aggregate API servers. Those clusters have the most to gain.

For new clusters, decide whether CUSTOMER_ROUTED should be part of the baseline. For existing clusters, test in a non-production environment with the same webhook and identity dependencies before updating anything important.

Then build a failure test. Block the webhook endpoint, break DNS resolution, and confirm your cluster behavior matches your expectations. Network privacy is useful only if the failure modes are understood.

This EKS change does not make private Kubernetes automatic, but it removes a real architectural compromise. Builders now have a better way to align the control plane’s outbound path with the same network rules they already apply to workloads.

Lambda MicroVMs make isolated sandboxes a serverless design choice

Emiliano Montesdeoca — Mon, 22 Jun 2026 00:00:00 +0000

AWS Lambda MicroVMs are interesting because they do not try to replace normal Lambda functions. They fill a different gap: workloads where the unit of isolation is not an event, but a user session, coding environment, agent run, scanner job, or other stateful sandbox.

The AWS announcement frames this around isolated sandboxes with full lifecycle control. That is the right framing. The practical value is not only that Firecracker provides VM-level isolation. It is that AWS is exposing a managed way to create, pause, resume, and retire those environments without asking every product team to become a virtualization platform team.

What changed

Lambda MicroVMs add a serverless compute primitive inside the Lambda family for running code in isolated, stateful execution environments. The source article describes several important properties:

each session can run in its own Firecracker-backed MicroVM,
environments can launch and resume from pre-initialized snapshots,
memory, disk, and running process state can survive during the session,
idle environments can be suspended by policy,
applications get a dedicated endpoint and short-lived request authentication.

That combination matters for applications that cannot fit cleanly into stateless request-response functions. A code interpreter, browser automation sandbox, vulnerability scanner, AI coding assistant, data notebook, or game scripting environment often needs process state and filesystem state between interactions.

Why builders should care

The old decision tree was uncomfortable. Virtual machines gave strong isolation but slow startup and more operations. Containers started quickly but shared a kernel, which raises the bar for safely running untrusted code. Lambda functions were operationally simple but not designed for long-running interactive state.

Lambda MicroVMs create a new middle path. For builders, the design conversation becomes more precise:

Use Lambda functions for event handlers and short stateless tasks.
Use containers when you need packaging flexibility and can manage isolation risk.
Use Lambda MicroVMs when each tenant, user, or agent run needs a dedicated stateful sandbox.

This is especially relevant for AI systems. As more applications let agents write code, execute tools, inspect repositories, or process customer files, isolation becomes part of the product boundary. A prompt injection bug should not become a cross-tenant file access bug.

The trade-offs are still real

MicroVMs reduce a lot of infrastructure work, but they do not remove architecture responsibility.

First, lifecycle policy becomes a cost control. If idle sessions stay warm too long, the bill can drift. If they suspend too aggressively, users feel resume latency. Teams should treat idle duration as a product setting, not a default copied from a sample.

Second, snapshot-based startup changes how applications initialize. Code that generates unique state, opens long-lived external connections, or assumes initialization happens once per user action needs careful review.

Third, stateful sandboxes need cleanup rules. Temporary files, credentials, downloaded packages, generated artifacts, and logs can accumulate. Builders should define what survives during a session, what is exported, and what is destroyed.

Finally, security does not stop at VM boundaries. The execution role, outbound network policy, source artifact pipeline, token handling, and tenant mapping are still part of the isolation story.

What to do next

I would start with workloads where the current workaround is obviously expensive: per-user EC2 sandboxes, over-hardened container runners, or Lambda workflows full of awkward /tmp and rehydration logic.

For a proof of concept, validate four things before celebrating:

Cold launch and resume behavior with your real image size and dependencies.
Idle cost profile for normal user behavior, not a synthetic benchmark.
Tenant boundary tests for filesystem, process, network, and IAM access.
Failure cleanup when a session crashes, times out, or is abandoned.

Lambda MicroVMs are not just another Lambda feature. They are AWS acknowledging that the next wave of serverless workloads includes interactive, stateful, sometimes untrusted execution. That is a useful primitive, as long as teams treat it as an isolation architecture rather than a shortcut around security design.

Lambda runtime upgrades need campaigns, not reminders

Emiliano Montesdeoca — Sat, 20 Jun 2026 00:00:00 +0000

Lambda runtime upgrades are easy to postpone because the function still works today. Then a deprecation date becomes a support risk, a security exception, or a rushed platform campaign.

The AWS Compute Blog post on upgrading Lambda function runtimes at scale with AWS Transform custom is useful because it treats runtime upgrades as a repeatable modernization workflow, not a calendar reminder.

What changed

AWS Transform custom can analyze codebases, identify runtime upgrade risk, apply transformations, update dependencies or configuration, and validate against exit criteria. The source article focuses on Lambda runtime upgrades, including cases where code changes are required, such as moving callback-style Node.js handlers toward async patterns.

For platform teams, the more important piece is campaign management. A large organization rarely has one repository and one owner. It has hundreds of functions across many accounts, teams, IaC stacks, and CI/CD systems.

Why builders should care

Runtime upgrades fail when ownership is unclear.

An application team may not know a function exists. A platform team may see the deprecated runtime but not understand the deployment path. Security may see the exposure but not own the code. The result is a spreadsheet-driven migration with too many manual exceptions.

A transformation tool can help, but only if it plugs into an operating model:

inventory functions and repositories,
map functions to owners,
assess code and dependency risk,
create pull requests or transformation branches,
validate with tests and alarms,
deploy with safe rollout and rollback,
track completion centrally.

That is a campaign, not a one-off task.

The trade-offs

Automated transformation is not a substitute for test coverage. It increases the value of tests because it can make changes faster than humans can manually review every edge case.

Before trusting a runtime upgrade, teams should check:

unit and integration tests for handler behavior,
dependency compatibility with the target runtime,
infrastructure definitions that set the runtime,
packaging and build pipeline changes,
observability after deployment,
alias or version-based rollback strategy.

For platform teams, the other trade-off is centralization. Pushing changes across many repositories can be efficient, but service owners still need to understand and approve behavior changes. The best model is usually centralized orchestration with distributed ownership.

What to do next

Start with inventory. Use AWS Config, Trusted Advisor, CLI scripts, tags, or deployment metadata to find functions approaching deprecation. Then group them by owner, runtime, business criticality, and test readiness.

For low-risk functions with good tests, an automated transform can move quickly. For high-risk functions with poor tests, the first transformation should probably add documentation, tests, or alarms before changing the runtime.

The lesson from this AWS post is broader than Lambda. Modernization work becomes manageable when it is continuous, visible, and validated. If runtime upgrades are still handled as emergency cleanup, the tooling can help, but the process needs upgrading too.

Before downsizing EC2, simulate the EBS burst budget

Emiliano Montesdeoca — Wed, 17 Jun 2026 00:00:00 +0000

Rightsizing EC2 usually starts with CPU and memory. That is necessary, but it is not enough.

The AWS Compute Blog post on simulating Amazon EC2 EBS burst credits before downsizing an instance is a good reminder that storage performance can be the hidden reason a “safe” downsize becomes a production incident.

What changed

The article walks through a simulation approach for burstable EBS-optimized instance performance. Instead of looking only at average utilization, it pulls EBS read and write metrics from CloudWatch, compares them with the baseline and burst ceiling of a target instance type, and simulates whether IOPS or throughput credits would run out after downsizing.

The important metrics include:

EBS read and write bytes,
EBS read and write operations,
EBS I/O and byte balance percentages,
instance EBS IOPS and throughput exceeded checks.

The outcome is more useful than a simple utilization chart: it tells you whether the target instance can absorb the actual I/O pattern without draining credits and throttling.

Why builders should care

Cost optimization often fails when it optimizes one dimension in isolation. A smaller instance can look perfect from a CPU perspective but have lower EBS baseline performance. If the workload depends on burst credits during business hours, the downsize may save compute cost while increasing latency, queue depth, or database wait time.

This matters especially for databases, analytics workers, search nodes, and file-heavy applications. It also matters when downsizing reduces memory. Less memory can mean less cache, which can increase disk I/O after the move.

The practical lesson: do not approve a downsize until the storage path has been modeled.

The trade-offs

The simulation described by AWS is intentionally conservative when using maximum statistics over five-minute periods. That is useful for safety, but it can overstate credit drain. If the conservative simulation passes, confidence is high. If it fails, the answer is not automatically “do nothing.” It may mean you need higher-resolution data, a different target instance, gp3 tuning, or application-level changes.

Also, burst credits are not a performance strategy. They are a buffer. If normal business traffic depends on sustained bursting, the workload is under-provisioned for its real behavior.

What to do next

For each EC2 rightsizing candidate, add a storage check to the decision process:

Pull at least two weeks of CloudWatch EBS metrics, four if the workload has monthly cycles.
Check whether the current instance already hits EBS exceeded metrics.
Compare peak and sustained I/O against the target instance baseline and burst ceiling.
Simulate credit balance for both IOPS and throughput.
Monitor EBSByteBalance%, EBSIOBalance%, and exceeded checks after the change.

This is the kind of small operational habit that prevents cost work from damaging reliability. A good rightsizing recommendation should say not only “CPU is low,” but also “storage credits survive the target shape.”

Emiliano Montesdeoca

Mon, 01 Jan 0001 00:00:00 +0000