Timothy Williams
March 17, 2016


There was another SRE Tech Talk this month. They had two speakers: a data
scientist from Datadog, and an SRE Manager from Google. Wall of text

The Datadog speaker covered some methods for identifying outliers and
anomalies in whatever metrics you use, and how those can be used for
monitoring and alerting. He defined outliers generally as values outside
expected ranges. Anomalies were time dependent - a value that was unusual
for the usual pattern of the data.
He specifically mentioned two algorithms, whose names now escape me, but it
was not complex statistics. One algorithm was used for alerting when values
were that mattered, and another for when time was an important factor.
He then briefly covered some currently under-development features they were
working on, such as auto-correlation of data.

The Google speaker covered SLIs, SLOs, and SLAs (Service Level Indicators,
Objectives, and Agreements). It was a solid introduction to how a SRE team
thinks about SLAs, specifically the measurement and the error budget. He
discussed what the engineers role was in defining the SL*s, which involved
a lot of sales, customers, and lawyers.
It was interesting to hear his approach to impossible SLAs; he said that
the right approach was to measure what they were capable of currently and
present that to the sales people. He would then see what it would take to
move his capabilities to match that desired SLO (A certain latency,
availability, or some other metric). It wouldn't always be practical, but
he then suggests starting to move the system such that it could meet that
SLO anyway. It might not be on a timeframe helpful for that customer, but
he uses that request as an indicator of what future demands might be.

It was a solid set of talks and there were a lot of good people to talk to
