The Continuous Turning

Share this post
Terraform @ Scale
thecontinuousturning.substack.com

Terraform @ Scale

Scaling Security, Operations, and the Cloud with Terraform

Mason Huemmer
Jan 16
1
Share this post
Terraform @ Scale
thecontinuousturning.substack.com

The Wild West

Service-oriented architecture, or SOA, is an approach to software design made popular around the late-90s. Without going into too much detail, SOA is a method of building applications through modular components in a distributed system. Microservice Architecture is a type of SOA that takes an opinionated approach to define service boundaries and promote independent deployability.

As more and more product teams decide to develop microservices, the more Terraform will need to scale and grow with the demands of these architectures.

The growing trend has been to develop microservice architectures in the cloud. Since the cloud makes it so easy to stand up and provision resources, it has led inexperienced teams at no fault of their own to run the cloud as the wild west. From provisioning resources intended only for testing and then forgetting to tear down to an engineer tearing down the wrong resources like a production server. The cloud is simple to use, but has an unspoken complexity to it that threatens many businesses and organizations. The challenge when it comes to the cloud is around governance and compliance.

Governance and Compliance

The same can be said about Terraform. Since Terraform is so simple to understand and pick up, there have been many who have jumped on the band wagon and found that as they scale its usage, it becomes more difficult to manage and support. What typically happens is that as you increase your Cloud footprint, then both your IaC solution and Cloud environments become unmanageable and as a result you see an increase in overall cost, operational instability, and security risks, which all threaten your business.

Before team’s begin to experience these issues firsthand and attempt to design their own Governance strategies with Security teams, Terraform in unison with a good CICD engine can solve many of these issues firsthand and be apart of the solution in helping teams safely and quickly deliver their products from the cloud to their customers.

Shifting Left

We need to rethink security and it’s impact to the organization before we try to address the issues around our cloud adoption.

First, security should impact culture, not the delivery of your products.

Security should be on the minds of every engineer as the product moves through its delivery lifecycle. The engineers are the owners of the services they build and own. This includes owning the quality and security of their products. You may have heard that security needs to “shift left”. And many agile teams rightly so have included Security advocates within their sprint planning and daily tasks. This has allowed Security engineers to no longer be an opposition to delivery, but rather be an advocate for it, ensuring the safety of the products that are delivered into the hands of customers. This also ensures products do not have to be refactored to meet audit requirements after a product has already become generally available. This includes cloud for the operations engineer.

But where does Terraform play a role in security?

Terraform is the vehicle security can use to properly govern and simplify the Cloud. It is the vehicle to shift-left and what security engineers should be recommending to impact culture, but not delivery.

Below is a detailed outline of how to help integrate Terraform with the Cloud, meeting the demands of security, operations, and the organization. As well as, ensuring development teams are able to quickly and safely provision infrastructure to ship out their products to the customer.

Step 1: Lock It Down To Stand It Back Up

Before you can automate, you need to lock down your cloud environment.

Terraform should be the only service allowed to provision infrastructure within the Cloud. The days having engineers provision infrastructure from a portal or GUI should be over.

Replacing a mouse click with a Terraform manifest or configuration takes more time. But only in the beginning. As your team uses Terraform and develops a bigger code base of modules for standing up infrastructure, they will be moving faster in delivery and can respond quicker to incidents. This transition will take time. You must crawl before you can walk so you can eventually run.

Where I find this transition to be the most difficult to adopt is with the operational teams, not the developers. Central operations or platform teams have the permission and access to make changes through the portal or GUI. They believe it to be easier and faster to make a change in the portal, rather then through a Terraform configuration. Most times, it is quicker.

Securing our Operations Engineer

This is why I recommend you remove their provisioning access first. Then, find the engineer that loves to write Terraform only and setup their account as a break-glass account.

The break-glass account is used for emergency purposes only to gain access to a system or service that is not accessible under normal controls.

If Terraform becomes inoperable or the type of change is a change Terraform cannot solve, then you will want to ensure static accounts have access to the Cloud. The obvious benefit is will ensure you do not get locked out of your own Cloud environment.

It is recommended to create 2–3 break-glass accounts owner access per Azure management group or AWS organization. It is recommended to provide these accounts with explicit access, rather then through AD groups. For these accounts, it is recommended to create an isolated pipeline that only a handful of engineers have access to update the Terraform configuration that manages these explicit break-glass accounts.

As for when an engineer outside of this group needs direct access to the cloud in the event of a major incident, this is where you will want to develop a public “Break-Glass” pipeline that is accessible to the entire operations team.

They can add their own account as a resource that Terraform can automatically add as a “Contributor” or “Owner” to a specific resource. Once they have completed their task or resolved an incident, then the pipeline would automatically remove their access after 24 hours.

For long-term RBAC access, it is recommended to have Terraform manage the state of your RBAC strategy in the cloud through a pipeline job that requires approval or sign off from security. This ensures you can use the State file as documentation for an audit of who has access to what in the cloud.

Step 2: Encapsulating Security into Terraform and the Cloud

When designing Terraform to scale, Terraform modules become the first obvious choice to include in your strategy.

Module Encapsulation

Modules are a form of encapsulation. These modules are repeatable Terraform configurations that can be re-used across your organization. They are what you need to be able to build up your codebase so product teams can quickly stand up infrastructure rather then writing another Terraform configuration or manifest.

However, teams take the same approach as they did with the Cloud and using Terraform. They see it is a good idea to use and run with it, rather then slowing down and seeing how to incorporate this into a strategy that can support them long term.

This is how we slow down.

We should be encapsulating not only the operational requirements for a product (vm-size, disk-size, or location), but we should also ensure the security controls (managed identities, private endpoints, TLS 1.2, etc.) are in place so developers can safely deploy infrastructure. This makes sprint planning much easier for the developer, the operations engineer, and the security advocate.

Module Registry

This leads us to the next idea of a Module Registry. After you have started to build your codebase up, you need to have a registry that shows available modules that can be used within a team’s Terraform workspace. There are plenty of platforms out there that have private registries to host Terraform modules. This includes Terraform Cloud, GitLab, or AWS. You can use a solution already developed or design your own. But the point is these modules need to be known by your developers so they can use them.

Governance Policies

If they do not use them, then security controls will not be in place. This is why using Governance policies are recommended. So if a team does not use a module to deploy infrastructure, then the Governance policy will be enforced to ensure that the infrastructure is unable to be stood up.

Using modules will be much faster for the developers then designing their own configuration because it will not only have the security controls already included in the design, but also meet the Governance policy required to stand up the resource. Otherwise, they will be spending a lot of time trying to figure out how to stand up the resource. This creates an incentive for developers to rely on the modules you share.

Just a side note here, but you should educate your developers that security is not the same as compliance. Just because they overlap does not mean they are the same thing. So if they get past the Governance policies, we want to ensure they are still using the modules as they include security controls.

Just because your infrastructure or Terraform configuration are compliant with SOC2 or Azure / AWS policies does not make your infrastructure secure. This is why you need to rely on both modules and Governance policies to properly secure and protect your developers.

Step 3: Let Design Principles Be Your Guide

This step primarily focuses on how you should design your Terraform configuration and modules using the principles from Sam Newman’s book called Building Microservices.

Coupling and Cohesion

What I would like to present first is the idea of cohesion. As said by Sam Newman, cohesion is “code that changes together, stays together.”

Grouping similar kinds of resources inside a single module creates a strong cohesion that allows you to make changes in as few places as possible.

It is important to also keep in mind the concept of coupling. A change to one service’s infrastructure should not require the change to another unless by design.

Strong cohesion with low-coupling builds strong service boundaries. Where there are weak service boundaries, there is no stability at scale.

Figure 1, as an example, has each workspace call a set of shared modules that cross product-lines. The products are tightly coupled with the shared modules that provision their infrastructure. If a change is made to Module A, then changes must also be made to each workspace that calls Module A.

Figure 1, Tightly Coupled Infrastructure

It is recommended then to design modules around the infrastructure of the supporting services that compose a product. The modules then inherit the same service boundaries as the product’s architecture. This sets clear boundaries around what the module does and who owns the infrastructure for that module. These boundaries allow for independent deployability and where most change will be confined.

Figure 2 is an example of strong cohesion and low coupling. It designs modules around each supporting service. When a change is made to AlphaGo’s Service Module A, it does not necessitate a change to other product’s workspaces, only to the parent calling the child module. If AlphaGo’s Service Module A supports a specific API, then the API team is responsible for that module. It lowers the divide between our product and operations, handing responsibility over to product teams to maintain their own infrastructure.

Figure 2, Loosely Coupled Infrastructure

It is important to note that there are still good reasons to design shared modules that cross product-lines, or are tightly coupled as you will read later on.

Keep in mind as you design modules to be tightly coupled that it will be difficult to fully implement those changes when required. This is why designing them in a way to reduce this kind of change limits the overhead and overall maintenance for the future.

Information Hiding

According to Sam Newman, information hiding is described as the process of “hiding as much information as possible inside a component and exposing as little as possible via external interfaces.”

Information hiding sets clear boundaries around what can be easily changed and what is more difficult to change.

To apply this to terraform modules, what can be easily changed is the encapsulated resources inside the module. What is more difficult to change is the external interface or the declared variable definitions of your module, which is used by outside consumers (i.e. developers, workspaces, parent modules) to stand up your infrastructure.

Vertical Change

A Vertical change is a change made to an external interface (i.e. variable definitions of your module) to support new functionality that directly impacts consumers. To use this new functionality, consumers must update their own configuration files and the module resource they call. In a vertical change, the more a module is shared across an organization, the more difficult it will be to support it. This can slow down development time and the adoption rate to those new changes. Therefore, in a microservice world, engineers should try to limit these kinds of vertical changes.

Thankfully, we are able to version modules to reduce the impact a vertical change has to a consumer. However, there are times when you will have to create a breaking change and teams will be unable to use their existing version.

There are two solutions to solve this problem. Terraform Objects and Microservice Modules.

Step 4: Object-oriented Terraform

Passing Objects to and from Modules

Objects within Terraform, or what is also called a terraform complex data type, allow for the flexibility necessary to support the idea of information hiding and keeping our configuration files dry.

The object can be used as your external interface, where we can limit the amount of vertical change that impacts your module’s consumers.

Instead of defining each variable one at a time, we should define complex data types to encapsulate more than one value (i.e., string, numbers, booleans, etc.) in a module’s external interface.

Keep in mind that with this approach, it may be difficult to understand the required inputs when using broad variable constraints like complex data types. It is recommended that you provide a README that can be easily interpreted as to what is required to run a module or even a workspace to ensure you do not lose comprehensibility.

Figure 3 is an example of how to document complex data structures for workspaces and/or modules.

Figure 3, Complex Data Types

If using Terraform Cloud, a README can be imported for each workspace using the Version control workflow and for modules added to the Private Registry.

In the AlphaGo workspace outlined below, the global object is passed into the workspace through Terraform TFVAR files.

The global and alphago objects are required to provision the resources in the alphago-aks-cluster module.

# workspace.tf

#---------------------------------------------------
# VARIABLES
#---------------------------------------------------

variable "global" {
    type = any
}

#---------------------------------------------------
# MODULES 
#---------------------------------------------------

module "alphago_aks_cluster" {

source  = "./module/alphago-aks-cluster"

# GLOBAL OBJECT
global  = var.global
    
# PRODUCT/SERVICE OBJECT
alphago = {
    resource_group = {
      name = "my_resource_group"
      location = "centralus"
    }
    aks = {
      name = "my_aks_cluster"
      default_node_pool = {
        name = "my_default_node"
        node_count = 1
        vm_size = "Standard_DS2_v2"
      }
      addon_profile = {
        azure_policy = {
          enabled = true
        }
        kube_dashboard = {
          enabled = true
        }
      }
    }
  }
}
  

#---------------------------------------------------
# OUTPUT 
#---------------------------------------------------

output "resources" { # output properties objects from each module
    value = {
        "alphago_aks_cluster" = module.alphago_aks_cluster.properties
    }
}

As you can see objects are passed into module allowing the developers to set the configurable options available in the object.

# module.tf

#---------------------------------------------------
# RESOURCE GROUP
#---------------------------------------------------

resource "azurerm_resource_group" "default" {
  name     = var.alphago.resource_group.name
  location = "eastus"
}

#---------------------------------------------------
# AKS CLUSTER
#---------------------------------------------------

resource "azurerm_kubernetes_cluster" "example" {
  name                = defaults(var.alphago.aks.name, "default-aks-cluster")
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
  dns_prefix          = var.alphago.aks.name

  default_node_pool {
    name       = var.alphago.aks.default_node_pool.name
    node_count = defaults(var.alphago.aks.default_node_pool.node_count, 1)
    vm_size    = defaults(var.alphago.aks.default_node_pool.vm_size, "Standard_DS2_v2")
  }

  identity {
    type = "SystemAssigned"
  }

  addon_profile {
    aci_connector_linux {
      enabled = false
    }

    azure_policy {
      enabled = defaults(var.alphago.aks.addon_profile.azure_policy.enabled, false)
    }

    http_application_routing {
      enabled = false
    }

    kube_dashboard {
      enabled = defaults(var.alphago.aks.addon_profile.kube_dashboard.enabled, false)
    }

    oms_agent {
      enabled = false
    }
  }
}

Notice I am not able to set the identity as it is hardcoded in the module, ensuring every AKS Cluster stood up by this module will always require a SystemAssigned identity.

However, they are able to set the default node pool, which can impact the performance of the services they own.

You are able to set default values for resource arguments using the defaults function or conditional expressions referenced in the TF documentation. This way a missing input does not keep the module from provisioning resources, but is in line with operational standards or guardrails.

One great benefit is that engineers can add additional values or overwrite existing ones (that the object has stored) before passing them onto the module using the merge command from the TF documentation.

Future changes that require additional inputs will not have to be reworked as the module’s external interface because the complex variable supports encapsulation of multiple variable data types and the setting of default values.

Step 5: Microservice Modules

You will want to design tightly-coupled modules to help give the business the confidence to deploy infrastructure quickly and safely.

After reading Steps 2 and 3, you may assume tightly-couple modules are evil or some sort of an antipattern. From my perspective, it is only an antipattern when you are not aware of how your code is coupled or you are unable to refactor your code if a major change is necessary.

However, with this strategy it depends on tightly-coupled modules, called micro-modules. The idea is to develop a module for every resource that is used by the organization. This means you are creating an abstraction layer above every resource to ensure you can inject shared security controls, all operational requirements, and organizational standards.

This means you will have a module for resource groups, virtual machines, virtual machine interfaces, virtual machine data disks. Any and every resource by the organization. This is what is meant by building up your codebase.

When teams want to spin up a resource that is not available through a module, then this is where we plan, build, and deploy the resource module together with security. This is how we shift-left with Terraform, the Cloud, and the Organization.

Final Thoughts: Guardrails Not Fort Knox

This is last section is a reminder to trust your engineers.

The solution you are designing should not be to control and lock down everything like Fort Knox. As much as the business loves red tape, try to put it aside and realize there are other solutions to deliver the same confidence the business asks for.

This is where we create guardrails instead of red tape for your developers to navigate complex systems and platforms they might not have the time to fully understand or know how to operate.

It is important that your engineers know that you trust them to stand up infrastructure and deliver quality products to customers. It will give them the freedom they need to really flourish.

If you try to control it all, you will lose everything.

Share this post
Terraform @ Scale
thecontinuousturning.substack.com
Comments

Create your profile

0 subscriptions will be displayed on your profile (edit)

Skip for now

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

TopNew

No posts

Ready for more?

© 2022 Mason Huemmer
Privacy ∙ Terms ∙ Collection notice
Publish on Substack Get the app
Substack is the home for great writing