AWS Cost Optimization

Updated May 7, 2018

Blog / Tech

Amazon Web Services (AWS) offers great solutions for infrastructure needs. AWS requires no hardware, and provides the opportunity to focus on product development rather than on infrastructure maintenance, and upgrades. However, using AWS solutions at Brainly scale, where the number of EC2 running hours can exceed 900k each month, can become very expensive, making continuous cost optimization a critical necessity. Let’s review some of the lessons we’ve learned in dealing with this situation at Brainly.

Choose EC2/RDS Instances Carefully

The first and most obvious rule is to use EC2 and RDS instances that are a good match for their intended purpose. AWS offers a number of different EC2 instance types optimized for different scenarios:

General Purpose — Good for all-around instances. The t2 type is perfect for usage with Auto Scaling Groups.
Compute Optimized — Perfect for heavily computed operations, ML interfaces, media transcoding, and web servers.
Memory Optimized — Good for heavy-memory-usage solutions like Redis, Cassandra, MongoDB, Spark, etc.
Accelerated Computing — Provides access to hardware-based computing accelerators such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs).
Storage Optimized — Good for any solution which needs to be able to have tens of thousands of IOPs like logs or data-processing data.

Similar to EC2 Instances, Amazon provides three dedicated RDS Instances groups:

Standard DB
Memory Optimized DB
Burstable Performance DB

Pros: Cost savings optimized performance based on needs.
Cons: You need to know your stack prior to starting infrastructure. Proper stack planning is a must.

Use EC2 Auto Scaling

AWS EC2 Auto Scaling monitors application resource usage to provide the best performance at the lowest possible costs by scaling pools of servers as needed. There are several metrics which can be used to scale EC2 instances up or down (Fig. 1.).

Currently the most-used metric at Brainly is CPU Utilization. Our scaling policy measures the AVG amount of CPU Usage for the entire Auto Scaling Group and scales resources accordingly.

Fig. 2. shows 5 different Auto Scaling Groups from the last 5 months. The spikes in the graph are showing our live traffic peaks from November through April. Because there are fewer requests during Winter Holidays, there is a lower number of running instances during late December and early January.

Fig. 2. 5 different ASG Groups size graph for last 5 months

In the above graph we are using 2 types of instances:

t2.micro — Perfect for apps which don’t require a lot of resources but can handle traffic while scaled horizontally.
c4.large — Used to process Gearman Job Server queues. For additional savings we are using them as a part of the AWS Spot Fleet.

Pros: Money and resources saving
Cons: Not all applications can use ASG, so proper planning is a must.

Use EC2 Reservations

EC2 Reserved Instances (RIs) are one of most popular methods for lowering infrastructure costs. The whole idea behind RIs is to commit to an estimated usage of EC2 instances.

There are 3 types of EC2 reservations available in AWS:

No Upfront Reservation — No Upfront RIs offer a significant discount (typically about 30%) compared to On-Demand prices. You pay nothing upfront but commit to pay for the Reserved Instance over the course of the Reserved Instance term. This option is offered with a one year term.
Partial Upfront Reservation — Partial Upfront RIs offer a higher discount than No Upfront RIs (typically about 60% for a three-year term). You pay for a portion of the Reserved Instance up front, and then pay for the remainder over the course of the one- or three-year term.
All Upfront Reservation — All Upfront RIs offer the highest discount of all of the RI payment options (typically about 63% for a three-year term). You pay for the entire Reserved Instance term (one or three years) with one upfront payment and receive the best effective hourly price.

Amazon allows users to check RI Coverage to help identify instance hours that are not covered by RIs and to highlight opportunities for savings.

As described in a previous article, the Brainly platform is heavily based on AWS Autoscaling Groups. Thus, the number of running EC2 instances is heavy related to traffic and the number of requests. Fig. 3. shows RI coverage graph in a ten-month timeframe. Spikes represent a number of request-changes during a night/day period and, as can be seen after doing proper reservations, RI coverage hits almost 100%. Assuming that we are using No Upfront Reservations, costs can be lowered by 30%.

Pros: Lower costs up to 63% when compared to On-Demand instances.
Cons: Can be problematic if you choose Standard Offering Class.

Use RDS Reservations

AWS also provides reservations for RDS instances. This solution works similar to the EC2 Reservations with similar savings ratios.

Fig. 4. shows RI coverage across a 12-month timeframe. In this case RI coverage is lower because of dynamic stack changes. More reservations will be done after infrastructure stabilisation.

Pros: Lowers costs up to 63% in comparison to On-Demand instances.
Cons: Can be problematic if infrastructure is dynamic.

Use Spot Instances

Amazon EC2 Spot Instances optimize costs on the AWS and can scale an application’s throughput up to 10X for the same budget. By using Spot Instances when launching EC2 instances, up to 90% can be saved when comparing to on On-Demand prices.

The whole idea of Spot Instances is to use AWS unused EC2 capacity. “The hourly price for a Spot Instance (of each instance type in each Availability Zone) is set by Amazon EC2, and adjusted gradually based on the long-term supply of and demand for Spot Instances.”

One of the main differences between Spot and On-Demand instances is that Spot instances can be stopped immediately. Amazon EC2 provides a Spot Instance interruption notice, which gives the instance a two-minute warning before it is interrupted. This warning is given once the declared maximum price for EC2 instance is exceeded.

Fig. 5. Shows some of the latest Spot Requests with a max price of $0.05/h, while the regular price stands at $0.113/h. This means that until AWS sets the price of the c4.large instance at $0.05/h, we will pay less than 40% of the On-Demand price. Once the max price is exceeded, the Spot Instance will be closed after 2 minutes.

While there is a clear price advantage, Spot instances should be used carefully and only in certain specific scenarios.

Pros: Lowers costs up to 90% in comparison to On-Demand instances.
Cons: Needs to be planned carefully

Keep Eye on a New EC2 Generations

Another advantage to using AWS Platform is that Amazon provides regular updates. Besides introducing amazing new ML services, AWS is also focusing on their core service: EC2.

As we were using Mesos instances based on c4.2xlarge instances, the main struggle for us was the speed of the network connection on Mesos Agents. Because of network speed limit, microservices running on Mesos Agents started throttling. To prevent that, more instances of Mesos needed to be initiated to automatically generate cost increases. Fortunately, AWS introduced a new generation of EC2 Compute Instances c5 last year.

Direct comparison on Fig. 6. shows that c5.2xlarge have a 14.51% better rate of performance, faster internet, and 14.29% lower pricing than c4.2xlarge.

Upgrading Mesos Agents to c5 family gives us a performance boost, which automatically translates to a lower number of running agent instances while keeping running Mesos Jobs on the same level (Fig. 7.).

Fig. 7. Number of running Mesos Agents. Black line shows average number of Mesos Agents.

Pros: Improved performance, usually better pricing.
Cons: Keep an eye on already existing reservations. This can be problematic for No Upfront Reservations.

Optimize CloudWatch Costs

Amazon CloudWatch is a monitoring service for AWS cloud resources and applications run on AWS. The clear advantage of having a high quantity of metrics gives an ability to monitor infrastructure very precisely.

Fig. 8. shows an example of CloudWatch Metrics for one Mesos Cluster.

Fig. 8. Example Amazon CloudWatch Metrics

Each request to CloudWatch costs money. The more requests, the greater the expenditure. In Brainly’s case, the most important metrics are:

RDS-related Metrics
CPU Credit Balance for EC2 instances
Size of EC2 Auto Scaling Groups
Summary costs per AWS Account

Once we started working on cost optimization, we decided to change the frequency of gathering data from CloudWatch. As one example, the request for EC2 Auto Scaling group size was changed from req/10s to req/60s. This frequency is sufficient to spot any unexpected ASG behaviour and also lowers costs significantly.

For January we did about 153M requests to CloudWatch. After optimization we were able to limit it to 46M requests in February and March. Also the amount of PutLogsEvents was significantly lower. As seen in Fig. 9., after those changes costs were cut nearly in half.

Fig. 9. Amazon CloudWatch costs before and after optimization.

Pros: It’s good to gather a lot of metrics…
Cons: …but requesting too much information from CloudWatch can be costly.

Plan Logs Carefully

Proper logging is a must at any scale. This allows us to monitor the health of the infrastructure, generate alarms when needed, compare old data with a new, and see any improvement or degradation within a specific timeframe.

At Brainly we extensively use ELK Stack. It’s used for storing application logs, microservices logs, third-party apps logs etc. ELK is a great and flexible solution for storing any kind of data, but at scale logging stack needs to be planned carefully.

Fig. 10. Example Metrics from ElasticSearch shown in Grafana

The main goal with ELK stack is to have all available logs stored for 2 or 3 months. Regarding some critical metrics we need to be able to store them for a minimum of one year with fast and easy access.

Fig.11. shows table before and after ELK optimization.

Fig. 11. ELK Stack Before and After Optimization

The main optimization includes:

Introduce Aggregation for the most heavy metrics (i.e. access logs).
Disable indexing on fields which are not searchable.
Additional Elasticsearch configuration tweaks.

Fig. 12. Shows Elasticsearch cluster CPU usage before and after optimization.

Fig. 12. Elasticsearch CPU Usage Before and After Optimization

Pros: It’s good to gather a lot of metrics…
Cons: …but monitoring infrastructure can be costly at a large scale.

Conclusions

In this article I tried to describe some best practices which can lower AWS Costs. There are a number of solutions for optimizing AWS Cloud costs: Preferred Pricing, which can be negotiated by large customers, decisions regarding using AWS-dedicated solutions, and ELB vs ALB at scale, etc. The only rule is to make our customers happy while not exceeding the company budget. That’s why the journey for improvements and savings will never end.

Thanks to Sergey Gerasimenko.

[simple-author-box]

Let's start sharing knowledge

What do you need to know?