Thursday, December 2, 2010

Automating EBS Snapshot validation with @fog - Part 1

Background

One thing that's very exciting about the new company is that I'm getting to use quite a bit of Ruby and also the fact that we're entirely hosted on Amazon Web Services. We currently leverage EBS, ELB, EC2 S3 and CloudFront for our environment. The last time I used AWS in a professional setting, they didn't even have Elastic IPs much less EBS with snapshots and all the nice stuff that makes it viable for a production environment. I did, however, manage to keep abreast of changes using my own personal AWS account.

Fog

Of course the combination of Ruby and AWS really means one thing - Fog. And lot's of it.

When EngineYard announced the sponsorship of the project, I dove headlong into the code base and spent what time I could trying to contribute code back. The half-assed GoGrid code in there right now? Sadly, some of it is mine. Time is hard to come by these days. Regardless, I'm no stranger to Fog and when I had to dive into the environment and start getting it documented and automated, Fog was the first tool I pulled out and when the challenge of verifying our EBS snapshots (of which we're currently at a little over 700), I had no choice but to automate it.

Environment

A little bit about the environment:

  • - A total of 9 EBS volumes are snapshotted each day
  • - 8 of the EBS volumes are actually raid0 mysql data stores across two DB servers (so 4 disks on one/4 disks on another)
  • - The remaining EBS volume is a single mysql data volume
  • - Filesystem is XFS and backups are done using the Aleastic ec2-consistent-snapshot script (which currently doesn't support tags)

The end result of this is to establish a rolling set of validated snapshots. 7 daily, 3 weekly, 2 monthly. Fun!

Mapping It Out

Here was the attack plan I came up with:

  • - Identify snapshots and groupings where appropriate (raid0, remember?)
  • - create volumes from snapshots
  • - create an m1.xlarge EC2 instance to test the snapshots
  • - attach volume groups to the test instance
  • - assemble the array on the test instance
  • - start MySQL using the snapshotted data directory
  • - run some validation queries using some timestamp columns in our schema
  • - stop MySQL, unmount volume, stop the array
  • - detach and destroy the volumes from the test instance
  • - tag the snapshots as "verified"
  • - roll off any old snapshots based on retention policy
  • - automate all of the above!

I've got lots of code samples and screenshots so I'm breaking this up into multiple posts. Hopefully part 2 will be up some time tomorrow

No comments: