Best Practices in AWS Management - Part 1

Introduction to this series

This post is the first in a series that is going to talk about best practices for managing your AWS infrastructure and applications. This is an entry level guide and is not meant to make someone an expert, but rather to give people that are just getting started a foundation of best practices.

Introduction to this post - Best Practices in AWS

The public cloud in general, and AWS in particular are changing the way that systems administrators think about the infrastructure that they manage and the applications that run on that infrastructure. Things are becoming far less permanent and more ephemeral and temporary. We now deal in instances that last until the next deployment instead of servers that are bought every 3-5 years. We now purpose build resources for an application instead of making a new application fit on existing hardware resources. This requires a new approach and new best practices.

Part 1: Performance

What are we talking about when we are talking about performance? Are we talking about how much CPU or memory is available? Are we talking disk I/O? Are we looking at network bottlenecks? Well, in this context we are talking about all of those things, and none of them. We need to be thinking about the experience of our users and how our application is responding to them. All of the things that I mentioned above can play into that, but they are no longer the big metrics that old-school sysadmins like myself use to stress over. Our application should now be more spread out and that means that we need to spend more time looking at the system as a whole, and less time looking at the individual pieces.

We also need to change the way we think about growth and scaling for performance. We no longer need to "buy big" and have a bunch of extra capacity sitting in our data center so that we can quickly respond to scaling needs. Now we just need to buy what we need at the moment and we can add to it later.

General Guidance

  • Use Trusted Advisor This is a service that is provided by AWS that can find obvious performance problems such as over-utilized instances, excessive security group rules, and cache-hit ratios.
  • Plan for performance to scale, not grow As discussed above, think about your application and your infrastructure in such a way that it is easy to scale. This may be by adding more EC2 instances or more containers for instance.
  • Monitor, monitor, monitor I am not saying to watch your CPU and memory all the time. Those things should be looked at, but they are just details. You can and should be using tools that allow you to monitor your systems and application as a whole. Tools like New Relic can monitor your application's usage in real time and show you how it is performing for your end-users. This is far more useful than knowing that the CPU on an instance is at 33% utilization.

Databases

Datasbases reequire more thought and planning. First you need to decide if you are going to use RDS, DynamoDB, or if you are going to install and manage your own database server on an EC2 instance. In general I break this decision down like this:

  • Is my data just key/value pairs? If yes, use DynamoDB. If no, keep going.
  • Do I need very high performance or custom settings that require a high level of engineering, management, and or tweaking? If yes, install a database on an EC2 instance and manage it yourself. If not, keep going.
  • If you have reached this point, your probably want to run a database on RDS. This is the easiest, and many times the most cost-effective solution unless you have needs that cannnot be met well by RDS. Once you have made that decision there are a few other things to consider. First of all, if you are going to run your own database server on EC2 you want to use provisioned IOPS and create RAID-0 volumes for speed and performance. Also, do NOT install a database on an EFS file system. It is simply too slow for that type of IOP load. Finally, you should think about replication. Do you need it? If so, do you just need multiple availability zone replication? Do you need read-replicas? Do you need to replicate to other regions? All of these options can cost both in terms of performance and overall cost.

Brief Case Study: Educational Software Company

This company was running in two different data centers and had a public cloud provider. There infrastructure consisted of application, database, and utility servers running Centos 6 on ESXi. The databases were MySQL and the application and utility servers were running a Java based app with Apache, Tomcat and a Grails stack.

For just one client, the configuration consisted of 8 servers dedicated to running MySQL, 14 app servers, 2 utility servers, and an NFS server. The performance of this setup was terrible. The average application response time was ~600ms, the average end-user response time was ~4 seconds and servers were constantly running out of memory, crashing, and needing to be restarted. Additionally, they had no extra capacity or room to grow.

All of this infrastructure was consolidated in AWS. The 2 colocation data centers and the other public cloud provider were all eliminated. The databases were moved to RDS and the application and utility servers were moved to EC2.

For the same customer environment described above, the configuration now consists of 6 RDS instances for the databases, 4 application server, 1 utility server, and an EFS file system to replace not only the NFS server, but the underlying SAN as well.

The performance improvement was amazing. The application response time went to 80-100ms. The end user response time went to 1-2 seconds. The application servers are no longer running out of memory and crashing. And, not only was all of this performance gained, but costs were cut by almost 50%.

Next time: Part 2 - Security