AWS: Control Planes and Data Planes

Created by Abdul M Gill in Articles 20/03/2025

Control Planes and Data Planes in AWS

The control plane includes the systems that configure resources running in the data plane. Control planes provide the administrative APIs used to create, read/describe, update, delete, and list (CRUDL) resources. In other words, control plane allocates and configures the resources, and the data plane runs them.

Data plane consists of the systems for consuming those resources, which is basically primary function of the service.

Control planes and data planes are decoupled to enhance the resiliency so that a failure in the control plane should not impact the data plane. AWS incorporate this design principle in most of their services to enhance the performance and availability of their services.

Data Plane’s Operation with Impaired Control Plane

The control plane replicates its configuration data to multiple replicas of data plane distributed across the regions. That enables the data plane to continue working in the event of control plane impairment. Data plane can access and operate on resources that has been already provisioned with or without control plane.

For instance, if the ability to deploy and configure a load balancer is out, we can continue to utilize pre-deployed load balancers to serve requests based on current configuration. Similarly, EC2 control plane is responsible for allocating and reconfiguring instances. Data plane is responsible for currently running and interacting with EC2 instances and can do so even when control plane becomes unavailable.

Difference Between Control and Data Planes

Control planes are usually more complex compared to data planes. Hence failures more common in control planes. Based on the principals described in the section above, we can generally isolate whether you’re using the control or data plane based on the service and the API action. Following are some examples:

Control Plane Actions:

Launch EC2

Create S3 bucket

Create a Lambda function.

Read resource attributes (describe a resource)

Update resource attributes

Update network configuration for an ALB

Delete resource

Data Plane Actions:

Interacting with resource

Running EC2 instance

SSH to EC2

Reading/writing to EBS volume

Putting objects in S3

Answering DNS queries

HTTP/HTTPS/TCP to load balancer

Lambda invocations

Getting item from DynamoDB table.

Putting item into DynamoDB table

Performing health checks

Application Dependencies

Any given application’s degree of dependency varies on use case basis. Each workload must be assessed individually by thoroughly examining the AWS API calls issued by it and analyze the answers to the following questions:

What is the ratio of control plane and data plane APIs?

Creating resources? [CP]

EC2 autoscaling

EC2 reconfiguring

RDS, Fargate, SageMaker etc. depend on EC2 (CP)

Working with existing resources, reading/writing data only (DP)

How does this ratio contribute to business continuity risk factors?

How to mitigate these risks?

How are failover runbooks impacted to react accordingly?

Control Plane Risk Mitigations

Depending upon application’s exposure to control plane hiccups, risk mitigation strategy may differ accordingly. Here are some examples:

Adjust the compute (VPC, subnets, EC2, load balancers, Lambda, RDS, Fargate, etc.) capacity preemptively. Pre-allocate enough to accommodate predicted load spikes. Scale down to align with your load cycle.
Plan to execute manual failover from healthy region, e.g., RDS slave promotion, enabling Lambda triggers.
Make your applications region agnostic and idempotent. That itself may help eliminate regional control plane dependencies.
Eliminate or reduce dependency on control plane.
Avoid dependence on Route53 routing policy updates for failover. Instead, leverage endpoint health checks and/or other CloudWatch metrics to accomplish the same.
Perform AZ evacuation when isolated Availability Zone is impacted, impairing availability or latency. Services with Availability Zone Independence (AZI) such as Amazon EC2 and EBS, because parts of those services have control planes that are also zonally independent.

Data Plane: Mitigate impact by preventing work from being routed to or stop work from being done in the impacted Availability Zone.
Control Plane: Update the configuration of resources with control plane actions to both prevent capacity from being provisioned in the impacted Availability Zone as well as stop inter-Availability Zone communication with that Availability Zone.
Use Route53 Application Recovery Controller (ARC) API call for routing control to a regional endpoint of a cluster.

Periodically test and validate your risk mitigation plans.

Canary Probes to Monitor Control Plane Hiccups

Once the application dependency is established on control plane with details API list scoped by the application’s business as usual (BAU) activities, the API calls should be incorporated into application health checks. For example, instead of returning a status HTML page, it can be replaced with a dynamic page that can perform all the required API calls on test/dummy resources to avoid data corruption and only then return an OK (200) status.

Canary Probe Example-1: S3

A canary probe (e.g. Lambda) for S3 control plane actions may consist of the following S3 API operations performed on a dummy resource within a specific region.

CreateBucket:
PUT /v20180820/bucket/ HTTP/1.1
Host: Bucket.s3-control.amazonaws.com

...LocationConstraint...

PutBucketPolicy:
PUT /v20180820/bucket//policy HTTP/1.1

Host: Bucket.s3-control.amazonaws.com ...

PutBucketTagging
PUT /?tagging HTTP/1.1

Host: Bucket.s3.amazonaws.com ...

. . .

Canary Probe Example-2: API Gateway

A canary probe for API gateway actions may consist of the following control plane API operation performed on a dummy resource within a specific region.

CreateRestApi:
Creates a new RestApi resource.
POST /restapis HTTP/1.1

Content-type: application/json

CreateDeployment:
POST /restapis/restapi_id/deployments HTTP/1.1

Content-type: application/json

DeleteDeployment:

DELETE /restapis/restapi_id/deployments/deployment_id HTTP/1.1

DeleteRestApi:

DELETE /restapis/restapi_id/deployments/deployment_id HTTP/1.

. . .

Conclusion

A good combination of the use of both AWS Health and CloudWatch would serve for a good DR monitoring strategy. Leveraging a dedicated event bridge rule tracking the AWS Health event-type codes covered herein, in combination with CloudWatch alarms for the aforementioned metrics, would provide some good value in terms of DR monitoring. All these signals should be aggregated in Nobl9 with well-defined error budgets to facilitate a data-driven DR failover decision making.