[cs615asa] Google's SRE Talk for March

Wed Mar 23 20:30:42 EDT 2016

Like some of the other students in the class. I have also attended Google's SRE talk for the month of March.

The first presenter was Homin Lee from Datadog. He is not a sys admin per se. He writes software tools to help sys admins monitor the health of their servers and set thresholds/alarms accordingly.
First, he showed us how dashboards and charts can get cluttered quickly. Sometimes humans make mistakes and cannot easily catch anomalies in charts.
For simple data points like swap usage, disk usage etc. threshold alerting can be the perfect solution to automating this problem.
For other data points, like machines that run certain jobs, network usage, throughput etc. threshold alerting may not be perfect as the thresholds will constantly need to be changed and may generate a lot of false positive alerts.
He outlined some algorithms that Datadog uses to detect outliers and anomalies in data. 
He categorised these algorithms as 'robust outlier detection algorithms'; they do not simply look for thresholds but intelligently look at historical data.
He wanted his algorithms to behave much like how humans look at charts i.e the algorithm looks for data that does not behave like past data and alerts. 
He presented two formulas used by his monitoring platform for catching outliers in a data set.
Median absolute deviation (MAD): for a data set {x1, x2, ..., xn}, is calculated by median( {|xi-median(Xj)|} ). MAD is resilient to outliers. 
Hence, it will not alert on small differences that a human would ignore by eyeballing a chart.
The other algorithm he discussed was density-based spatial clustering of applications with noise. 
This algorithm clusters together points in a data set that are related and marks outliers as points in low density regions. 
This a parameterized algorithm whose parameters (epsilon aka the tolerance level, and min samples) can be changed over time to better tune the algorithm and reduce false alerts.
Basically, all these algorithms try to find anomalies in a data set. 
For example, we were shown a chart showing machines that wake up at night and run some sort of job. 
These algorithms may be able to alert if a machine's CPU usage remains high after the job was done, or if one set of machines never wakes up or runs the job until completion.
The presenter also discussed issues with these algorithms. If the data set contains a lot of deviation some of the algorithms may become worse at detecting anomalies over time.
Again these algorithms are not perfect; they may still generate false positive alerts. However, they are much better than having a human glance at the charts and try to spot the anomalies.
His system is still in alpha and has just been deployed to production, so he could not provide us with information on the systems accuracy and its false positive rate.

The second presenter was Carlos O’Ryan from Google. He discussed SLAs (service level agreements), SLOs (service level objects) and SLIs (service level indicators). 
He described his thought process on how to deal with SLAs, and discussed how to deal with clients that require certain unreasonable SLAs.
A service level agreement is the formal requirement that (usually) the client comes up with, this contract also states what the engineers will do if they can't meet the agreement.
Service level indicators are the data points that will be measured to achieve the SLA's i.e uptime, system response, throughput etc.
Service level objectives are the boolean statement that describe the SLI. i.e uptime > 99.999% 
The first step when a client comes to Carlos' team with certain requirements, is for his team to figure out how to measure the requirement that the client specifies.
This is the first step, as when the client and the sys admins are on the same page, they can serve each other or their company better (if they work for the same company).
One thing that the presenter focused on, is that a lot of times the clients come up with unrealistic requirements.
He believed that systems administrators should meet their SLAs if possible. 
If not, they should work with the client to collect the data (SLO/SLI) and show the client that they are trying to meet the requirements, but are unable to, because the requirements are unfeasible.
There is an important difference between SLAs being too unreasonable and the engineering team not being able to meet the (reasonable) SLAs due to downtime, hardware or software issues. 
If the SLAs are reasonable, the company the engineering team works for must pay the client, or usually if the client also works for the same company, the engineers must promise to work over time and better meet the SLAs.
The presenter also discussed what to do, if an engineering team is meeting and exceeding their SLAs. One view is to maybe save the company some money, shut down some servers and still meet the SLAs. Another view, is to reduce the SLAs so they better match reality.

These presentations where interesting. Although, I am not a system administrator I work in the software industry. 
It was interesting for me to get a perspective on how system administrators deal with SLAs, and how these things are discussed with clients. 
In my company, we mostly use cloud service providers for deploying software, and things like SLAs/SLOs are usually negotiated with the cloud service provider. As software engineers, we are not always aware of these things.
It helped me get a perspective that when we deploy applications in data centers that we do not own, there are always engineers getting alerted at night, working on meeting the SLAs.
I also learned new terms: SLOs, SLAs, SLIs, these helped make some of the discussions easier to understand.
Furthermore, it was also useful to learn about the various algorithms being developed to better monitor software, these can be even more valuable if they can predict future system behaviour.
I chose this tech talk because it was presented by Google, one of the largest tech companies in the US. 
I also wanted to learn more about systems administration as it is not something I deal with every day. The topics were well presented and the presenters were knowledgeable and able to answer questions.