[cs615asa] Homework#N - Attend a Meetup/Talk/community event.

Ramit Farwaha rfarwaha at stevens.edu
Mon May 2 15:57:46 EDT 2016


Hi All,

I attended Google SRE Tech Talks on Wednesday 4/20/2016 at their office in
NYC. I chose to attend this event because I am interested in the field of
SRE.  Coincidently, I also ran into 2 of my co-workers who are on the SRE
at the Company I work for. This event consisted of 2 talks, both SRE's; one
from DataDog and 2 Google employed SRE Engineers.  The total time of the
event was about 2 hours, they provided an hour before starting with a
food/mingle session and then a mingle session afterwards as well.

Talk concentrated on Postmortems.

DataDog is a company which brings together data from servers, databases,
applications, tools and services to present a unified view of the
application that run at scale in the cloud.  SRE Engineer described an
incident at DataDog and steps to take. First was what to check, if it is
the CPU then the best course of action is to scale up. He then went into
detail in what was included in the Postmortem.

   1. Tracing network issues in the cloud
   2. Correlating TCP retransmitted segments with latency
   3. What is dropping the packets?

A quick hypothesis test feedback loop:


   1. The wrong analysis still left to the right action
      2. What went wrong? - An appealing but wrong hypothesis is really
      hard to abandon
   - Saved by a second pair of eyes and ears to poke holes

Google's 2 SRE Engineers described how the Building Coroner(internal
project on crash reporting) works.  There are SRE's from different teams
who come together to work on Coroner and collaborate to work to analyze the
crash reporting.  Having all the crash reporting in one place provides a
more secure centralized collections service that can be used by all teams.
It supports all main languages and automatically analyzes failures to help
reduce mean time to repair. They described how the crash reporting will
communicate with an external system from a signal handler.  Collections
service which handles writing crash metadata to a database and whether to
throttle.  Final analysis is one of the main benefits of a centralized
collection service is that we can build analysis poplins to process crash
data as it arrives.

My learnings from this meetup was a high level overview in how other
companies conduct Postmortems and similarities/difference I can compare to
how the company I work at (Sailthru) does.  Also, it's good to see how
other enterprise companies overcome their incidents and difficulties they
see with product.  I got a little more out of the first half from DataDog
as Google's SRE was very detailed on their product and lost me at some
parts.


Thanks,
Ramit Farwaha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.stevens.edu/mailman/private/cs615asa/attachments/20160502/196d6893/attachment.html>


More information about the cs615asa mailing list