[cs615asa] Google NYC SRE Tech Talk - 03/16

Thu Mar 17 13:44:52 EDT 2016

Hello everyone,

Last night I attended the 3rd SRE Tech Talk at Google in NYC which is part of their new monthly series that started this January. It was a great event and opportunity to mingle with other engineers and hear what some industry professionals had to say about the field of system administration.

First up was a presentation about detecting outliers and anomalies by Homin Lee from Datadog. He started off by covering the basic idea of monitoring different system/application metrics and alerting operations staff when these metrics breached a certain threshold, like HDD utilization reaching 90%. This kind of simple Outlier Detection can be improved on to get rid of some of the noise of false alerts by applying 1 of 2 proposed algorithms: Median Absolute Deviation (MAD) or Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Both of these algorithms basically compare metric values at each time point to the median values across some time period and flagging anything out of these bounds as an outlier. However, this methodology is more prone to noise when applied to other metrics such as latency for an application which can be affected by nightly batch jobs or other “seasonal” activities on its hosts. For this, Anomaly Detection makes more sense as it involves decomposing a time series of metric values into several components such as a general trend line, seasonality, and noise. Analyzing all of this data together can help classify incoming metric values as anomalies or not. Unfortunately, this also means that your anomaly model needs to be adjusted to account for anomalies that occur to make sure they don’t become part of the model itself and provide false alerts when things are normal. Regardless, Datadog is currently working on this Anomaly Detection feature for their monitoring product and implementing an auto-correlation feature which will build correlations for metric values by comparing each metric value to several of those before it. Clearly, a good monitoring system that automates the process of having to look at different time ranges for anomalies yourself is beneficial to any system administrator as they can then focus on more important tasks.

The second presentation was about SLIs, SLOs, and SLAs by Carlos O’Ryan from Google. A Service Level Indicator (SLI) is a metric that your service’s users would like to know about, like system uptime or latency. A Service Level Objective (SLO) is what kind of goals you can achieve with your SLIs, like a system uptime >= 99.99% or a latency <= 100ms. A Service Level Agreement (SLA) is a formal agreement with your users of what SLOs you will meet and what you will do when you can’t meet them. For example, if the users are paying for an SLA specifying latency <= 100ms and the latency goes up for an extended time, you would likely offer the users some kind of refund which can actually be part of a so-called “error budget” set aside for times like these. Carlos argued the point that engineers should be telling salespeople what are realistic SLOs and SLIs so that they can provide their customers with reasonable SLAs that can actually be achieved. The SLOs can also be used as a benchmark for your service to determine whether more work needs to be done to achieve SLOs, like optimizing some code, or some cutbacks can be made if the service is constantly going over the SLOs, like reducing the number of servers needed. SLAs can also be adjusted if SLOs are set too low or the error budget can be reduced. Carlos also pointed out a difference between external SLAs with customers and internal SLAs with your fellow peers in the company for interval services. Internal SLAs should be more reasonable than external ones to avoid instances where a system administrator would have to constantly wake up in the middle of the night and have to do some work on an internal service. The main point here was that the engineers behind the service or application being provided to customers should take part in the conversation between salespeople and customers in order to set SLAs that can realistically be achieved given the current situation or work out a way that would be able to handle demanding SLAs, like by hiring more developers.

These 2 presentations were quite insightful into the advanced data mining techniques being applied to monitoring systems to help make system administrators’ lives easier and how engineers should play a larger part in the SLA process to set reasonable expectations. The main reason that I chose this event was that it was led by Google which has a large global footprint in datacenters and offers a variety of services to customers, so they have a wide range of experience with system administration themselves. I was also curious to see what topics would be discussed since they don’t mention them beforehand as far as I can tell and its great to hear that these SRE Tech Talks are a monthly series. I definitely recommend others to attend at least one of these tech talks.

Thanks,
Michael Peleshenko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.stevens.edu/pipermail/cs615asa/attachments/20160317/3c12175b/attachment.html>