X

This site uses cookies and by using the site you are consenting to this. We utilize cookies to optimize our brand’s web presence and website experience. To learn more about cookies, click here to read our privacy statement.

Building a Cloud-first Infrastructure Platform: Lessons from an AWS Transformation

Author: Zachary Loeber Posted In: AWS, Cloud

When our team embedded with the infrastructure and security group at a global manufacturing company, we knew we were signing up for something ambitious. The engagement brief was straightforward on paper: help modernize a fragmented cloud environment, improve security posture, and introduce Infrastructure as Code. What we discovered on day one was that “fragmented” was doing a lot of heavy lifting in that sentence.

What follows is a walkthrough of the technical decisions, architectural patterns, and hard-won lessons from that yearlong transformation. We are sharing this because the challenges we encountered are common across organizations of this scale, and the approaches that worked for us are worth passing on to other practitioners facing similar problems.

Starting Point: Mapping the Terrain

Before any transformation work could begin, we needed to understand what we were dealing with. The environment had grown organically over many years, and the reality was messy: multiple legacy AWS accounts with no centralized governance, a near-total absence of IaC, networking built on a web of transit gateway peerings, and manual, ticket-based provisioning for almost everything.

One of the first things we did was map every AWS account, every major workload, and every network dependency. This took longer than expected and revealed surprises: accounts with no full visibility, credentials stored in pipeline environment variables, and workloads teams believed were decommissioned but were still running and incurring costs.

 


Tip

Before you start migrating anything, invest time in a complete inventory. The cost of a thorough discovery phase is far lower than the cost of encountering surprises mid-migration. Budget for this explicitly, because it rarely fits inside the time the business wants to give it.


 

 

The AWS Organization Strategy

The centerpiece of the modernization was consolidating all existing AWS accounts into a single AWS Organization with centralized governance. We were starting from a state where accounts existed in entirely separate organizations, or in some cases operated as autonomous roots with their own billing and policies, with no unified visibility, no consistent security controls, and no single pane of glass for anything.

Account Structure and Bootstrapping

We designed the target account structure around a clear hierarchy: a management account for organization-level governance, dedicated accounts for shared services (networking, DNS, security tooling), and team-owned workload accounts structured by business function and environment.

For each workload account, we built an automated bootstrapping process using Terraform. Every account gets the same baseline configuration on day one: GuardDuty and Security Hub enabled, CloudTrail logging configured, default VPCs removed, SSM roles deployed, and IAM permission boundaries set. Security controls were never optional. They were the default state of any new account from the moment it was created.


Key Decision

Treat account creation as a deployment, not an ops task. Bootstrapping via Terraform meant every account was identical, auditable, and could be recreated from scratch if needed. This also made compliance conversations much easier, because you can point to code rather than tribal knowledge.


Handling Complex Migrations

Some accounts were straightforward to migrate. Others required careful sequencing. A few were genuinely difficult. Three patterns came up repeatedly that are worth sharing:

The Network-Intensive Hub Migration: One account served as the central networking hub for the legacy environment: a multi-region setup with VPNs, transit gateways, and connectivity to every remote site. Migrating this account meant simultaneously retiring the legacy network architecture and cutting over to the new centralized network fabric without disrupting site connectivity for anyone.

Our approach was to run both network architectures in parallel for a period, using route table manipulation to gradually shift traffic. We migrated workloads in waves and validated connectivity at each step before proceeding. The final cutover was done in a maintenance window, but the bulk of the risk had already been taken out of it by the phased approach. For network-intensive migrations, time spent on parallel-run planning pays dividends in a smoother cutover.

The Autonomous Unit Migration: One environment had been operating with a high degree of self-governance, with its own billing, its own policies, and a team accustomed to managing their own account administration. Migrating this into the central Organization required careful coordination with that team, a detailed handoff of administrative responsibilities, and a thoughtful plan for transitioning without breaking production workloads.

The key lesson: when you are asking a team to give up autonomous control and operate within a shared governance model, the technical work is only half of it. The conversation about what they gain, including shared tooling, security coverage, cost visibility, and reduced operational burden, matters as much as the migration plan itself. Stakeholder alignment here is not a soft concern; it is a hard dependency.

The Speed-Over-Elegance Migration: One migration was constrained less by technical complexity than by business timeline. A critical set of environments needed to be in the new Organization on an aggressive schedule. We made a deliberate choice to prioritize speed: rebuild the environments using IaC from day one rather than lift-and-shift the existing infrastructure, accepting some short-term cleanup work in exchange for a clean, code-managed result on the other side. For environments that are already well-understood and not overly stateful, a rebuild is often faster and cleaner than a migration.

Building the IaC Platform: From Zero to Production

At the start of the engagement, the organization had essentially no Infrastructure as Code coverage of its AWS environment. Everything was console click-ops or one-off scripts. By the end of the year, we had dozens of IaC projects running across the environment, each registered to a team and traceable to a cost center.

The Shared Terraform Module Library

The foundation of the IaC platform is a centralized library of versioned, auto-documented Terraform modules. Rather than each team writing their own Terraform from scratch, or copy-pasting from each other with all the drift that creates, teams reference pinned module versions from this shared library.

We built modules across the full stack: VPCs and subnets, EC2 instances with standard security group patterns, RDS clusters, email sending configurations, WAF-protected CDN distributions, S3 buckets with lifecycle policies pre-configured, parameter store integrations, and more. Every module is linted, validated in CI, versioned with semantic versioning, and auto-documented using terraform-docs so teams can consume them without reading the source.

 


Approach

Modules should encode organizational best practices, not just abstract resource creation. If your VPC module does not enable flow logs by default, or your S3 module does not block public access by default, you are leaving security decisions to the consumer. They will not always make the right call.


 

The Self-Service Pipeline Model

We built a shared CI/CD pipeline framework that project teams can register into. The model works like this: a team submits a configuration file defining their project, the target account, and which modules they want to use. From there, they get a pipeline that plans and applies their Terraform on merge, with credentials scoped to only what their project needs in their target account.

This was a deliberate shift from the previous model where the central team was the gatekeeper for every infrastructure change. The key security control that made this work: pipeline credentials are short-lived role assumptions via OIDC, not long-lived access keys stored in environment variables. Every pipeline run assumes a role with a policy scoped to the specific account and actions that project needs. Even if a pipeline is compromised, the blast radius is contained.

 


Key Decision

Eliminate long-lived pipeline credentials entirely. OIDC-based role assumption gives you scoped, short-lived credentials per job with no secrets to rotate or leak. If your CI/CD platform supports it, and most do now, there is no good reason to still be using access keys in pipeline variables.


 

 

Self-Service Capabilities

Beyond the core IaC pipeline, we built a suite of self-service capabilities that removed the most common infrastructure bottlenecks teams were hitting:

  • Email sending can be configured by teams via a simple configuration file, including send quotas and DMARC guardrails, without submitting a ticket to the infrastructure team.
  • Isolated sandbox AWS accounts can be created on-demand via a code change. They are automatically bootstrapped with the standard security baseline and decommissioned after a configurable TTL.
  • Managed observability workspaces are provisioned and connected to SSO via registration, giving teams visibility without the operational overhead of running their own monitoring infrastructure.
  • Account permission sets are managed by account owners via identity group configuration. The infrastructure team defines the policy boundaries and teams manage their own memberships within them.

The common thread: the central infrastructure team sets the guardrails and builds the tooling, but teams have real autonomy within those guardrails. This is what made adoption stick. Teams will work around a platform that makes them feel like they are filing tickets with a slow bureaucracy. They will embrace a platform that makes them faster.

Networking: Simplifying a Complex Legacy

The legacy network was the kind of thing that makes your stomach drop when you first map it out. Multiple transit gateways, dozens of peering relationships, routes managed in multiple places, and no single authoritative view of how traffic actually flowed between sites and AWS accounts.

The target architecture was significantly simpler: all sites and the AWS Organization connect to a centralized SDWAN fabric managed by the network team. Network policies are defined and managed in one place. Routes to remote sites are automatically distributed into AWS, with no manual route table entries, no peering relationships to maintain, and no per-site configuration drift.

The guiding principle for network simplification: every point of complexity is a future incident waiting to happen. Transit gateway peerings that nobody fully understands, routes that exist for historical reasons nobody remembers, DNS zones split across three accounts. These are all debt with interest. The more you can centralize and automate network management, the more operable the environment becomes over time.

DNS Centralization

DNS had been a recurring source of pain, with availability incidents, split-horizon issues, and zones owned by multiple accounts with no central authority. The fix was conceptually simple but operationally involved: migrate all domains to a central networking account, consolidate internal zone management, and set up automated zone syncing from site-local DNS servers to the central account in preparation for decommissioning those servers entirely.

The key technical pieces were resolver inbound and outbound endpoints for hybrid DNS resolution, centralized private hosted zones per internal domain, and an automated zone transfer process to handle the ongoing sync during the transition period. Getting DNS right is foundational because almost everything else depends on it. DNS instability causes a disproportionate amount of operational pain relative to how unsexy the problem sounds.

Security: Building Defense in Depth Into the Platform

We treated security hardening not as a separate workstream but as a property of the platform itself. The goal was a state where doing the secure thing was easier than doing the insecure thing, where security was a feature of the infrastructure rather than an audit finding that came later.

Perimeter: CDN and WAF

Almost all externally exposed services were re-published behind a CDN layer with Web Application Firewall protection enabled. This gave us a consistent layer for DDoS mitigation, bot protection, and request filtering, managed centrally rather than service by service. We built a Terraform module for WAF-protected CDN distributions that encodes our standard rule groups, making it easy for teams to adopt the pattern without needing to understand WAF configuration themselves.

The organizational benefit of this approach is as important as the technical one. When security controls are in shared modules, they get updated in one place and every consumer benefits. Compare that to a world where each team rolls their own WAF configuration and you are forever chasing inconsistency.

Eliminating Long-Lived Credentials

One of the highest-impact security improvements was also one of the least glamorous: systematically finding and eliminating long-lived AWS credentials. We audited every IAM user, disabled users with non-rotating access keys, and replaced stored pipeline secrets with OIDC-based role assumption across all CI/CD pipelines.

The credential hygiene work was time-consuming and required coordination with a lot of teams, but the risk reduction was significant. Non-rotating credentials that live in pipeline environment variables are a durable, high-value target. They do not expire, they are often overprivileged, and they are frequently forgotten about after the original engineer moves on. Eliminating them removes an entire class of risk.

Organization-Wide Monitoring

GuardDuty and Security Hub were deployed across all accounts using AWS Organizations delegated admin, which allowed us to enforce enrollment at the org level rather than account by account. Findings are aggregated to a central security account, giving the security team a single place to monitor threat detection and compliance posture across the entire environment.

Security Hub aggregates findings across GuardDuty, AWS Config, Inspector, and IAM Access Analyzer. We configured custom Security Hub actions to route high-severity findings into the alerting pipeline. The important thing here is not just having the tools enabled. It is having a defined process for what happens when a finding fires. A GuardDuty alert that goes to a dashboard nobody watches is not a security control.

Fleet Management and Patching

Getting the EC2 fleet under Systems Manager was a prerequisite for consistent patch management. We deployed the SSM agent via State Manager associations so that any new instance in a bootstrapped account gets the agent automatically, with no manual step and no dependency on the AMI being pre-baked correctly. Patch baselines and maintenance windows were configured at the organization level, with account-level overrides available for workloads with specific compliance requirements.

SSM also unlocked the ability to eliminate bastion hosts and inbound SSH/RDP from security groups. Session Manager handles interactive access with full session logging and no firewall exceptions required. This is one of those cases where the more secure approach is also the more operationally convenient one.

Cost Optimization: Making Savings Systematic

Cost optimization is most effective when it is systematic and ongoing rather than a one-time cleanup event. The goal was to build processes that would keep the environment efficient over time, not just make it cheaper on day one.

Automated Instance Scheduling

The single highest-impact cost initiative was deploying automated scheduling to stop and start non-production compute and database instances outside of business hours. The key design decision: rather than managing schedules centrally, we made it self-service via resource tags. Teams opt instances in or out by tagging them. The central infrastructure team manages the scheduler logic, and teams control their own schedules.

This drove strong adoption because it gave teams direct control over their own cost without requiring central team involvement for every change. Teams that own their cost behave differently than teams that just submit invoices.

Storage Modernization

A large portion of the fleet was still using older-generation EBS volume types. Migrating to current-generation volumes is non-disruptive, since volume type modifications happen live without detaching, and the upgrade delivers both cost savings and better baseline performance. This is the kind of improvement that is easy to defer because it is not urgent, which is exactly why it tends to accumulate into significant waste over time.

Rightsizing and Waste Elimination

We built a process using Cost Explorer recommendations and CloudWatch utilization metrics to identify underutilized instances and work with workload owners to downsize them. We also audited for common sources of orphaned spend: unattached volumes, unused reserved networking resources, forgotten snapshots, and incomplete multipart uploads in object storage. S3 lifecycle policies to clean up incomplete multipart uploads alone eliminated a surprising amount of ongoing cost. It is one of those line items that grows silently and rarely gets noticed until someone looks.

Key Takeaways

Looking back on a year of this work, a few things stand out as especially important, not just technically but organizationally:

  • Make IaC the path of least resistance. If it is easier to click through the console than to write Terraform, your platform is losing. Invest in making the IaC experience genuinely fast and frictionless with good modules, clear documentation, and low-friction pipelines. Adoption follows convenience.
  • Security controls belong in the platform, not in documentation. GuardDuty enabled by default in every account, OIDC for pipeline credentials, WAF modules that make the secure pattern the easy pattern. These are more durable than any policy document or security review checklist.
  • The hardest migrations are not always the most technically complex ones. Migrating a team out of an autonomous environment they have managed for years is harder socially than technically. Getting teams to trust a shared platform requires demonstrating value early, communicating what they gain, and being willing to adapt the platform based on their feedback.
  • Cost savings are a byproduct of good architecture. Automated scheduling, rightsizing, storage modernization. These all become easier once the environment is well-organized, well-tagged, and observable. Chaos is expensive. Organized environments optimize themselves more easily.
  • Inventory everything before you touch anything. We found surprises at every level during the initial discovery phase, including forgotten accounts, phantom workloads, and credentials nobody remembered creating. Better to find them during discovery than mid-migration.
  • Monitoring is only a control if someone acts on it. Having GuardDuty and Security Hub deployed across every account matters. Having a defined process for what happens when a finding fires matters more. Close the loop.

The platform built over the course of this engagement is now the foundation for the next phase of work: expanding the self-service IaC platform to additional teams, piloting credential-less access patterns for hybrid workloads, and completing DNS centralization by removing remaining site-local servers. The work is never really done, but it is a lot more manageable when the foundation is solid and the environment is working with you rather than against you.

If you are working through a similar transformation or trying to figure out where to start, we would be glad to compare notes.