- Posted by Josh Quint, Lead Systems Architect
- 0 Comments
- AWS, DevOps, Perspective, Strategy
When I was growing up, I remember hearing the modified adage “To err is to be human, to really screw things up requires a computer”
This was repeated quite often, especially from the previous generation. Today, in the DevOps and Continuous Integration/Deployment world, there is an additional line that can be added: “To err is to be human, to really screw things up requires a computer, to screw up all things all at once in a coordinated effort is DevOps”
As a seasoned SysAdmin, the DevOps method of managing infrastructure is amazing, allowing management of large systems and environments with absolute efficiency. However, with that efficiency of keeping things running smoothly, also comes the ease at which they can go horribly wrong.
This was very publicly brought to light recently with the Amazon S3 outage in US-east-1 on February 28, 2017. According to the AWS summary:
“…an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
Imagine the turmoil inside that individual’s head minutes after they hit “Enter” on that command. We’ve all been there at some scale. Thoughts likely ranged from “Seriously? I know I typed that right, I’ve done it XX number of times!”
to “CTRL-C, CTRL-X, CTRL-Z, ESC, ESC, ESC!”
to “So, why are there no limits on that script to prevent this?!”
and I’m sure: “File-> Open-> Resume.rtf.”
This event inspired Turing Group to self-check our own DevOps infrastructure management practices. Could something like a simple missing input flag cause a massive outage for one or more of our clients? Do we need to re-architect our configuration management to prevent something like this?
Turing Group manages the infrastructure of our AWS and Linux-based environments with Ansible. We store and collaborate all of the infrastructure “code” in git repositories. This gives us insight into what changes have been made, how we can roll them back if necessary, and, of course, the ability to find out who did it (not to place blame, but to verify the use-case, etc.)!
Do our DevOps tools offer anything in the way of disaster prevention and sanity checks? Consider the following command we use in Ansible for regular maintenance on an environment:
/path/to/client/playbook/$ ansible-playbook site.yml --tags=maintenance --limit=stage
This script runs only the tasks tagged as maintenance, and only in the “Stage” environment. Once the changes are vetted, we can change the
prod and apply them to Production. However, what if we mistype the
/path/to/client/playbook/$ ansible-playbook site.yml --tags=mintenance --limit=stage
Since there are no tasks tagged as
mintenance, nothing happens. Things get a bit more interesting if we forget it altogether:
/path/to/client/playbook/$ ansible-playbook site.yml --limit=stage
This will run the entire playbook against the Staging environment, running all tasks and resetting all configuration to baseline. But, is this actually a problem? Not if you are practicing DevOps properly.
If you are practicing proper “Safe DevOps”, you have not made any changes to the environment outside of the Ansible scripts. Resetting the environment to “baseline” shouldn’t be an issue, as it should have never left baseline in the first place. True, some package updates might be triggered, but you were doing that anyway with the
Similar is true if you really screw up the command and forget the limits as well:
This runs all tasks on all environments including Production. Again, the baseline reset shouldn’t be an issue. You do run the risk of having package updates applied to Production without vetting them in Stage. However, we structure our playbook so that tasks are run sequentially though the environments. So, as long as your DevOps Admin is attentive, they should see that they ran the entire playbook while the tasks are applying to stage, and
CTRL-C the command before it gets to the production part of the playbook.
What happens if we’re not even in the correct playbook?
/path/to/wrong/playbook/$ ansible-playbook site.yml --tags=maintenance --limit=stage
Then a different environment than we intended gets the maintenance. However, it is maintenance designed for that environment, since each playbook is custom written for each environment. This might not be ideal, but not likely to be catastrophic.
Ansible is not without it’s pitfalls, consider ad-hoc commands:
/path/to/client/playbook/$ ansible stage -a 'rm -rf /'
Any Linux admin knows the old
rm -rf / command; it frees up disk space like no other! Coupled with DevOps, it frees up disk space on ALL the Stage servers at the same time. However, if you’re practicing “Safe DevOps,” then this command, no matter how tempting, should never enter into the picture.
In the case of the AWS S3 outage, it sounds like there were some failsafes built into the script. The compounding factor with their issue was, according to the article “we have not completely restarted the index subsystem … for many years.”
This is a testament to how well things do run at AWS, as this crippling mistake has not happened in several years of operation. Even if you are practicing “Safe DevOps” there may be a need to make ad-hoc changes, especially during on-the-fly troubleshooting of an issue. In light of this, it may actually be beneficial to make a “mistake” when running the playbook, and return the environment to baseline. This must be done in a regular and controlled manner (think Stage first, then Production), and if it is done on a more regular basis than “every few years,” the potential impact will be significantly less than catastrophic.
Overall, “Safe DevOps” means that ALL changes to the environment should be made through the configuration and change management system. That system needs to be architected to withstand the occasional human error. Changes should be rolled up through Testing and Staging environments, even when there’s a live Production problem. It always seems quicker to make fixes ad-hoc, but eventually, the environment will be “re-based” and any time saved via the ad-hoc changes will be lost by orders of magnitude if the re-base blows away years of patch-and-fix configs. Also, don’t fear the re-base, but use it in a controlled test to ensure that your environment is consistent.