[cs615asa] Meetup Summary
Dakota Crouchelli
dcrouche at stevens.edu
Tue May 1 01:18:41 EDT 2018
https://www.meetup.com/ContinuousDeliveryNYC/events/248071774/
This talk was led by Tyler Lund from Audible, who described in detail
what he referred to as "Chaos Engineering". This is essentially just a
formalized practice of breaking a system in bizarre ways in order to
test its resilience. I picked this talk because I was looking for
something light on technical detail where I could just soak in some best
practices for systems administration, or systems engineering.
The thesis statement for this talk boiled down to this: complex systems
are inherently chaotic, especially when they are public facing and
constantly evolving. There's no possible way to account for every gap
and edge case of a major production system (like Audible). Believing
that you have accounted for everything only allows your users to find
the chaos in your system before you do. It is always worth it, when a
project scales large enough, to dedicate time and resources to breaking
it in unexpected ways, and observing the results.
Tyler went on to give a set of steps in order to guide a future Chaos
Engineering team. He warned that employing this practice is difficult
without annoying other employees.
- Get support from others, an opt-in model in recommended for the
initial stages. No one likes to have their system brought down on
purpose without their permission. A new chaos engineering team should
start with non-critical services and move slowly into causing chaos
without warning.
- Make sure to have some kind of capable monitoring system already in
place. It's not useful to break a system if you can't then harvest
details on the effects. Focus on key business metrics, use visual
dashboards if you can.
- Take inspiration from recent issues to find new ways of causing chaos.
Make sure that subsystems gracefully re-start after being brought down,
with no lingering effects.
- When you have enough confidence in the smaller aspects of your system,
move onto causing large cascading failures that affect multiple areas.
Start making hypotheses of what you believe will go wrong, and see how
accurate your predictions are.
- Think about the grey areas of network issues. Add latency,
intentionally bring down external services or dependencies, throw
uncommon exceptions that may have been forgotten about, simulate packet
loss.
- If the Chaos Engineering endevour grows enough, consider adopting more
precise tools to help you. There are programs like Gremlins dedicated to
simulating bizarre failures. Use services like AWS to spin up
experimental clusters and compare how the system performs under various
loads.
Eventually, a chaos engineering team could be employed to run black box
testing like this constantly, in production, and without any warning. A
practice like this would give the confidence that strange ways a system
can fail are being discovered in-house, before they are found by users.
More information about the cs615asa
mailing list