Security | The AWS Blog

ACM ACME support turns certificate automation into a governance problem

Emiliano Montesdeoca — Tue, 30 Jun 2026 00:00:00 +0000

Certificate automation is becoming less optional. Validity periods are getting shorter, and manual renewal processes are exactly the kind of quiet operational risk that eventually becomes an outage.

AWS has now added ACME support for public certificates in AWS Certificate Manager. The headline is compatibility with ACMEv2 clients such as Certbot, cert-manager, and acme.sh. The bigger builder impact is that certificate issuance can become standardized without moving governance outside AWS.

What changed

ACM now provides a managed ACME server endpoint for public certificates issued by Amazon Trust Services. Teams can point existing ACME clients at ACM, request certificates through the protocol they already know, and still have ACM visibility, CloudTrail logging, CloudWatch metrics, and certificate expiry notifications.

The source article also highlights two controls that matter in real organizations:

domain scopes at the endpoint level,
External Account Binding credentials for client registration.

That means a central PKI or platform team can validate domains once, restrict what clients can request, and avoid distributing DNS credentials to every application team.

Why this matters

Most certificate incidents are not caused by cryptography being hard. They are caused by ownership being unclear.

One team owns DNS. Another owns the ingress controller. Another owns the application. Someone installed a certificate manually two years ago. Nobody remembers the renewal path until browsers start showing errors.

ACME support in ACM gives builders a cleaner operating model:

Kubernetes teams can keep using cert-manager.
VM and container teams can keep using common ACME clients.
Platform teams can centralize issuance policy and audit.
Security teams can see certificates in ACM instead of chasing external CA dashboards.

That is a good trade: keep the open protocol at the edge, centralize governance at the account or organization boundary.

The architecture trade-offs

This does not remove the need for certificate lifecycle design.

First, decide who owns ACME endpoints. If every workload account creates its own endpoint with broad wildcard permissions, you have recreated the sprawl problem. A better pattern is to define endpoints per domain boundary, environment, or platform domain, then scope allowed names carefully.

Second, treat External Account Binding credentials like bootstrap secrets. They should be short-lived where possible, stored securely, and rotated when registration flows change.

Third, think about wildcard issuance. Wildcards are convenient, but they expand blast radius. Many production environments should prefer exact domains and subdomains unless there is a strong operational reason.

Finally, price and volume behavior matter. Moving high-churn ACME issuance into ACM should be costed, especially for systems that create many short-lived certificates across tenants.

What builders should do next

If I were running a platform team, I would start with an inventory:

Which public certificates are already in ACM?
Which are issued by external ACME providers?
Which workloads use cert-manager or custom renewal scripts?
Which teams still depend on manual DNS validation?

Then I would pilot ACM ACME with one non-critical domain and one existing ACME client. The goal is not only to issue a certificate. The goal is to prove the operating model: endpoint ownership, scope policy, audit trail, renewal alarms, and incident response.

The useful part of this launch is not that Certbot can talk to AWS. It is that certificate automation can finally be treated as platform governance instead of a collection of one-off scripts.

Faster S3 access log queries make storage security more usable

Emiliano Montesdeoca — Mon, 29 Jun 2026 00:00:00 +0000

S3 access logs are valuable, but value delayed often becomes value ignored. If logs are difficult to query, teams use them only after something has already gone wrong.

The AWS Storage Blog post on querying Amazon S3 access logs instantly with CloudWatch and S3 Tables is interesting because it focuses on making access data easier to use in everyday operations.

What changed

The article walks through delivering S3 access log data to CloudWatch Logs and S3 Tables, then using those destinations for dashboards, alarms, and queries. The practical shift is from “logs exist somewhere” to “logs are queryable enough to support investigation and monitoring.”

For builders, that means S3 access patterns can become part of normal observability instead of a separate forensic workflow.

Why it matters

S3 is often where the most important data lives. Application logs, analytics exports, customer files, model artifacts, backups, and operational reports all end up in buckets. Yet many teams still monitor bucket configuration more carefully than bucket access behavior.

Queryable access logs help answer questions that matter:

Which principals are reading sensitive prefixes?
Did access spike after a deployment?
Are clients getting unexpected 403 or 404 responses?
Which workloads are driving request cost?
Did a suspicious IP enumerate objects?
Are lifecycle or replication assumptions visible in traffic?

When these questions are cheap to answer, teams ask them earlier.

The trade-offs

More logging is not free. S3 access log delivery, CloudWatch ingestion, query volume, table storage, retention, and dashboards all have cost implications. The right design depends on the bucket’s risk and traffic profile.

I would not send every low-value development bucket into a high-retention analytics pipeline. I would prioritize buckets that contain customer data, security logs, production artifacts, financial exports, backups, or AI training and retrieval data.

Also, access logs are only one layer. They should complement CloudTrail data events, IAM Access Analyzer, Macie, GuardDuty, S3 Storage Lens, and application-level audit logs. Different signals answer different questions.

What to do next

Pick a production bucket that matters and define three operational queries before building anything. For example:

top readers by prefix over the last hour,
denied requests by principal and source IP,
unusual data transfer volume compared with baseline.

Then build the logging path and dashboard around those questions. Add alarms only where the signal is actionable; noisy storage alarms quickly become invisible.

The main takeaway is that storage observability should be designed for regular use, not just incident response. Faster S3 access log queries make that much more realistic.

Secure ML environments need productivity and exfiltration controls together

Emiliano Montesdeoca — Mon, 29 Jun 2026 00:00:00 +0000

Machine learning teams need access to sensitive data, but the old answer of isolated desktops and manual monitoring does not scale well. It is expensive, slow, and frustrating for the people trying to build models.

The AWS Architecture Blog post on preventing data exfiltration in machine learning environments with Amazon SageMaker AI is useful because it balances two goals that are often treated as opposites: data scientist productivity and strict exfiltration control.

What changed

The source article describes a three-layer architecture using Amazon WorkSpaces Secure Browser, SageMaker AI, VPC endpoints, DNS controls, endpoint policies, and account restrictions.

The pattern is straightforward:

data scientists access the environment through a controlled browser,
browser capabilities such as upload, download, clipboard, and printing are restricted,
access is limited to approved AWS and SageMaker domains,
SageMaker runs without direct internet egress,
VPC endpoint policies restrict access to organization-owned resources,
DNS Firewall and private endpoints reduce bypass paths.

This is not a single magic control. It is layered friction against data leaving through the browser, the network, AWS APIs, or cross-account paths.

Why builders should care

ML environments are uniquely risky because they combine sensitive data, powerful compute, notebooks, terminals, package installation, model artifacts, and curious users. Blocking everything makes teams ineffective. Allowing everything creates obvious exfiltration paths.

A layered architecture gives teams a middle path. Data scientists can use managed ML tooling while the platform enforces where data can flow.

This is especially relevant for fintech, healthcare, insurance, public sector, and any organization fine-tuning or evaluating models on sensitive datasets.

The trade-offs

Every restriction has a productivity cost.

No internet access means dependency management must be planned. Package mirrors, approved repositories, model artifact flows, and update processes become platform responsibilities. URL allowlists must be maintained. Endpoint policies must be tested. Break-glass workflows must exist for legitimate exceptions.

There is also a risk of building controls that look strong but are not complete. If S3 endpoint policies block external buckets but another service path remains open, data can still move. If DNS Firewall covers only part of the environment, bypasses may remain.

What to do next

Start by mapping data egress paths, not by choosing tools. For an ML workspace, list how data could leave:

browser download,
clipboard or print,
external website upload,
S3 copy to another account,
package manager or shell command,
notebook output,
model artifact export,
DNS or HTTPS tunnel.

Then put controls around each path and test them with realistic user workflows. Involve data scientists early; controls that break normal work will be bypassed or abandoned.

The practical takeaway: secure ML platforms should not be air-gapped by default or open by convenience. They should make the safe path productive and the unsafe path difficult.

Restricting AWS Console access by network is a useful perimeter, not a complete identity strategy

Emiliano Montesdeoca — Wed, 24 Jun 2026 00:00:00 +0000

Identity controls usually answer who can sign in. Network-aware console controls help answer where that sign-in is allowed to happen.

The AWS Security Blog post on restricting AWS Management Console access to expected networks with sign-in resource-based policies and RCPs is important for organizations building data perimeters. It gives security teams another way to reduce the usefulness of stolen credentials.

What changed

AWS now supports patterns where Management Console sign-in can be restricted based on expected networks using sign-in resource-based policies and AWS Organizations resource control policies. The source article walks through a financial services scenario, CloudTrail verification, Console Private Access integration, and data perimeter alignment.

The builder impact is straightforward: console access can be constrained closer to the network boundary, not only by identity and MFA.

Why it matters

A compromised credential should not work equally well from anywhere on the internet. Network restrictions are not perfect, but they raise the bar and reduce the blast radius of many common incidents.

This is especially relevant for:

regulated environments,
privileged administration accounts,
break-glass access patterns,
organizations using centralized corporate egress,
teams implementing broader AWS data perimeter controls.

When combined with IAM Identity Center, MFA, least privilege, CloudTrail, and anomaly detection, expected-network policies add another useful checkpoint.

The trade-offs

Network conditions change. Employees travel, VPNs fail, disaster recovery procedures happen, and incident responders may need access from unusual locations. If console network restrictions are too rigid, they can become an availability risk during exactly the moment when access matters most.

I would treat this as a tiered control:

strict policies for highly privileged production administration,
documented exceptions for break-glass roles,
tested access paths for incident response,
monitoring for denied attempts,
clear ownership of allowed network ranges.

Also, do not confuse console restrictions with API restrictions. Many automation paths use AWS APIs, not the web console. A complete perimeter strategy must cover both human console access and programmatic access patterns.

What to do next

Start with privileged accounts and roles, not every user at once. Define the expected networks, verify them with CloudTrail, and test both allowed and denied sign-in paths.

Then rehearse failure cases:

corporate VPN unavailable,
emergency access required,
identity provider outage,
support engineer working from a different location,
network range changed but policy not updated.

The practical takeaway is that network-aware console access is a strong additional guardrail. It should complement identity, not replace it. Used carefully, it makes stolen credentials less useful without making legitimate operations fragile.

EKS control plane egress through your VPC closes a real private-cluster gap

Emiliano Montesdeoca — Mon, 22 Jun 2026 00:00:00 +0000

Private EKS clusters have always had two sides: how clients and nodes reach the API server, and how the API server reaches things on behalf of Kubernetes. The first side was easier to reason about. The second side had awkward edges.

AWS has announced customer-routed control plane egress for Amazon EKS, which routes customer-controllable Kubernetes API server outbound traffic through your VPC. That includes admission webhook callbacks, OIDC discovery, and aggregate API server requests.

This is a practical feature for teams that need private webhooks, private identity providers, and auditable network paths.

What changed

With the new controlPlaneEgressMode set to CUSTOMER_ROUTED, the Kubernetes API server uses an elastic network interface in your VPC for specific outbound flows. Those flows can then follow your routing, security groups, VPC endpoints, DNS, Network Firewall, PrivateLink, Direct Connect, and logging patterns.

EKS-managed service traffic still uses the EKS-managed path. The feature is scoped to customer-controllable API server egress, not every packet from the control plane.

One important detail: after a cluster uses CUSTOMER_ROUTED, the setting is immutable for the life of the cluster. That makes planning more important than experimentation on a random production cluster.

Why it matters

Admission webhooks are often part of the security boundary. They validate images, enforce labels, inject sidecars, block risky configurations, and integrate with policy engines. If the API server can only call a public endpoint, teams end up exposing services that they would rather keep private.

The same issue appears with external OIDC identity providers. If the control plane must fetch discovery documents and JWKS over an internet-reachable path, the cluster is not as private as the architecture diagram suggests.

Customer-routed egress makes a cleaner design possible:

webhook services can live behind internal load balancers,
private DNS can resolve API server dependencies,
VPC Flow Logs can show the path,
SCPs can enforce the required egress mode across accounts,
network teams can apply existing inspection and routing controls.

For regulated environments, the value is not only connectivity. It is evidence.

Design considerations

This feature moves responsibility to your network design. If DNS, routes, endpoint policies, or security groups are wrong, API server calls can fail. That can break admissions, identity association, or aggregated APIs.

I would pay attention to four areas:

DNS resolution. The API server now depends on the DNS path available from your VPC configuration for those customer-controllable names.
Webhook availability. A private webhook outage can become a cluster-wide admission outage if failure policies are strict.
Certificate trust. Private does not always mean privately trusted. The source article notes that OIDC issuer certificates still need a publicly trusted chain.
Cost and routing. NAT gateways, cross-AZ paths, inspection appliances, and endpoints can add cost or latency if the path is not designed deliberately.

What to do next

Start by identifying clusters that already use admission webhooks, external OIDC providers, or aggregate API servers. Those clusters have the most to gain.

For new clusters, decide whether CUSTOMER_ROUTED should be part of the baseline. For existing clusters, test in a non-production environment with the same webhook and identity dependencies before updating anything important.

Then build a failure test. Block the webhook endpoint, break DNS resolution, and confirm your cluster behavior matches your expectations. Network privacy is useful only if the failure modes are understood.

This EKS change does not make private Kubernetes automatic, but it removes a real architectural compromise. Builders now have a better way to align the control plane’s outbound path with the same network rules they already apply to workloads.

Lambda MicroVMs make isolated sandboxes a serverless design choice

Emiliano Montesdeoca — Mon, 22 Jun 2026 00:00:00 +0000

AWS Lambda MicroVMs are interesting because they do not try to replace normal Lambda functions. They fill a different gap: workloads where the unit of isolation is not an event, but a user session, coding environment, agent run, scanner job, or other stateful sandbox.

The AWS announcement frames this around isolated sandboxes with full lifecycle control. That is the right framing. The practical value is not only that Firecracker provides VM-level isolation. It is that AWS is exposing a managed way to create, pause, resume, and retire those environments without asking every product team to become a virtualization platform team.

What changed

Lambda MicroVMs add a serverless compute primitive inside the Lambda family for running code in isolated, stateful execution environments. The source article describes several important properties:

each session can run in its own Firecracker-backed MicroVM,
environments can launch and resume from pre-initialized snapshots,
memory, disk, and running process state can survive during the session,
idle environments can be suspended by policy,
applications get a dedicated endpoint and short-lived request authentication.

That combination matters for applications that cannot fit cleanly into stateless request-response functions. A code interpreter, browser automation sandbox, vulnerability scanner, AI coding assistant, data notebook, or game scripting environment often needs process state and filesystem state between interactions.

Why builders should care

The old decision tree was uncomfortable. Virtual machines gave strong isolation but slow startup and more operations. Containers started quickly but shared a kernel, which raises the bar for safely running untrusted code. Lambda functions were operationally simple but not designed for long-running interactive state.

Lambda MicroVMs create a new middle path. For builders, the design conversation becomes more precise:

Use Lambda functions for event handlers and short stateless tasks.
Use containers when you need packaging flexibility and can manage isolation risk.
Use Lambda MicroVMs when each tenant, user, or agent run needs a dedicated stateful sandbox.

This is especially relevant for AI systems. As more applications let agents write code, execute tools, inspect repositories, or process customer files, isolation becomes part of the product boundary. A prompt injection bug should not become a cross-tenant file access bug.

The trade-offs are still real

MicroVMs reduce a lot of infrastructure work, but they do not remove architecture responsibility.

First, lifecycle policy becomes a cost control. If idle sessions stay warm too long, the bill can drift. If they suspend too aggressively, users feel resume latency. Teams should treat idle duration as a product setting, not a default copied from a sample.

Second, snapshot-based startup changes how applications initialize. Code that generates unique state, opens long-lived external connections, or assumes initialization happens once per user action needs careful review.

Third, stateful sandboxes need cleanup rules. Temporary files, credentials, downloaded packages, generated artifacts, and logs can accumulate. Builders should define what survives during a session, what is exported, and what is destroyed.

Finally, security does not stop at VM boundaries. The execution role, outbound network policy, source artifact pipeline, token handling, and tenant mapping are still part of the isolation story.

What to do next

I would start with workloads where the current workaround is obviously expensive: per-user EC2 sandboxes, over-hardened container runners, or Lambda workflows full of awkward /tmp and rehydration logic.

For a proof of concept, validate four things before celebrating:

Cold launch and resume behavior with your real image size and dependencies.
Idle cost profile for normal user behavior, not a synthetic benchmark.
Tenant boundary tests for filesystem, process, network, and IAM access.
Failure cleanup when a session crashes, times out, or is abandoned.

Lambda MicroVMs are not just another Lambda feature. They are AWS acknowledging that the next wave of serverless workloads includes interactive, stateful, sometimes untrusted execution. That is a useful primitive, as long as teams treat it as an isolation architecture rather than a shortcut around security design.