[cs615asa] [CS514]Assignment#N Tech Talk

Wed Mar 23 11:40:45 EDT 2016

Hello everyone,
I attended Google's SRE tech talk on March 16. Here's the report.

A few facts about SRE
1.What is SRE?
   Site/Service Reliability Engineering has not been officially defined 
yet, but is described by its mastermind, Ben
   Treynor, as "an operational job that is done by a software engineer". 
An SRE engineer is basically a developer in
   possess of solid operational skills and be able to fix problems as 
well as spotting them.
2.SRE vs. DevOp
   SRE is a way of addressing conflicts between development team and 
operational team more efficient and methodical than
   DevOps, according to proponents of SRE(on the web). SRE and 
development team share a single staffing pool.
   Thus, well performed product => less Op workload =>  less personnels 
in operational team => more in dev team => ...
3.SRE vs system administration (relevance to this assignment)
   Essentially the task of SRE is similar as an operational team but with 
a different approach, so SRE also deals with
   product deployment, monitor and alerts, logging and QA, tech support 
etc, which are within the realm of sysadmin.
   Google hiring policy for SRE is  "...for those who are close to 
passing the Google SWE bar, and who in addition also have
   a complementary set of skills that are useful to us. Network 
engineering and Unix system administration are two common
   areas that we look at; there are others."

Summary of the tech talk:
The session consist of two topics.
1.Monitor and alert of outlier and abnormally(a data science approach).
   The majority work of data science lies in the preparation of raw data 
for further analysis. As I recalled in the
   session, a question was asked about how the data is prepared and it's 
meaning, but vaguely answered with a few
   tool names only. This speech introduced how data science can be used 
in SRE to automate monitor task.
   Data mining can be applied to some attributes such as system resource 
usage, web site statistics etc. for site
   vulnerability and system failure detection. The process is basically 
to analyze the data by algorithms(MAD and DBSCAN,
   as mentioned in the speech) and to alert sysadmins for further actions 
if any outlier or abnormally is concluded.
   A detailed description of these two algorithms can be founded at:

https://www.datadoghq.com/blog/outlier-detection-algorithms-at-datadog/
2.SLI SLO and SLA
   I didn't quite capture the big picture of this speech. Here are a few 
things I managed to grasp.
   1) SLI(Service Level Indicator) is used for quantifying SLO(Service 
Level Objectives) thus an SLA(agreement) can be
    enforced on a solid foundation.
   2) Sales team usually make SLO promises to which the speaker, as the 
service engineer, is impossible to achieve.
   3) The difference between external and internal SLA is that:
     a)For external SLA, which is between customer and service provider, 
a failure of achieving SLO results in financial
     punishment.
     b)The internal SLA is between the sales team and engineers. The 
engineers promise "... to get up in the middle
     of the night to address the problem so as to keep the SLO still 
met..."
   4) The underlying ideas of "error budget" is: "Devs can launch as many 
new features as they want as long as it can
      be afforded by the error budget mentioned in the SLO". New features 
are often buggy and leads to service failurs.
      The "budget" rule is the SRE way of balancing between service 
scalability and reliability.

Jingxi Yao