[cs615asa] Meetup Summary

Tue May 1 01:18:41 EDT 2018

https://www.meetup.com/ContinuousDeliveryNYC/events/248071774/

This talk was led by Tyler Lund from Audible, who described in detail 
what he referred to as "Chaos Engineering". This is essentially just a 
formalized practice of breaking a system in bizarre ways in order to 
test its resilience. I picked this talk because I was looking for 
something light on technical detail where I could just soak in some best 
practices for systems administration, or systems engineering.

The thesis statement for this talk boiled down to this: complex systems 
are inherently chaotic, especially when they are public facing and 
constantly evolving. There's no possible way to account for every gap 
and edge case of a major production system (like Audible). Believing 
that you have accounted for everything only allows your users to find 
the chaos in your system before you do. It is always worth it, when a 
project scales large enough, to dedicate time and resources to breaking 
it in unexpected ways, and observing the results.

Tyler went on to give a set of steps in order to guide a future Chaos 
Engineering team. He warned that employing this practice is difficult 
without annoying other employees.

- Get support from others, an opt-in model in recommended for the 
initial stages. No one likes to have their system brought down on 
purpose without their permission. A new chaos engineering team should 
start with non-critical services and move slowly into causing chaos 
without warning.

- Make sure to have some kind of capable monitoring system already in 
place. It's not useful to break a system if you can't then harvest 
details on the effects. Focus on key business metrics, use visual 
dashboards if you can.

- Take inspiration from recent issues to find new ways of causing chaos. 
Make sure that subsystems gracefully re-start after being brought down, 
with no lingering effects.

- When you have enough confidence in the smaller aspects of your system, 
move onto causing large cascading failures that affect multiple areas. 
Start making hypotheses of what you believe will go wrong, and see how 
accurate your predictions are.

- Think about the grey areas of network issues. Add latency, 
intentionally bring down external services or dependencies, throw 
uncommon exceptions that may have been forgotten about, simulate packet 
loss.

- If the Chaos Engineering endevour grows enough, consider adopting more 
precise tools to help you. There are programs like Gremlins dedicated to 
simulating bizarre failures. Use services like AWS to spin up 
experimental clusters and compare how the system performs under various 
loads.

Eventually, a chaos engineering team could be employed to run black box 
testing like this constantly, in production, and without any warning. A 
practice like this would give the confidence that strange ways a system 
can fail are being discovered in-house, before they are found by users.