[cs615asa] HW#N: Attend a relevant Meetup/Talk/community event

Mon May 2 11:29:12 EDT 2016

I chose to go to a Google SRE Tech Talk on April 20th. It sparked my
interested because there would be a talk by a member of Datadog, a company
I was interested in. Before the event, there was an hour for networking and
food. After that, there were two talks, one from a Sys Admin at Datadog and
another from two SREs at Google.

The first talk by the Sys Admin at Datadog was about dealing with a
‘postmortem’. Datadog is a tool that allows you to view metrics for your
infrastructure. The company had an issue where for about ~2 hours there web
page was down. As you can imagine, this was a huge problem because this web
page was a user’s primary access to view system metrics. The speaker told
us a story about what directions they went in to try to resolve the issue:

   - Checking if it was a geographical outage
   - Checking CPU loads on machines
   - TCP transmits

Their initial hypothesis was to just scale up. They believed their
instances were not able to handle the load, they just were not sure how to
come to this conclusion.

The company used AWS services, so after checking with Amazon, they found it
was not a geographical error. They also checked with clients to see where
the outage was, and there was no place to pinpoint where it started.

They also check the TCP transmissions from the systems and looked for
things like excessive throttling. From what the team saw, this also looked
fine. There were no dropped packets so they needed to move on to something
else.

Another thing that they checked was CPU load.  The initial hypothesis by
the team was to scale up their systems, but they did not know what caused
this. The solution was checking with AWS to see the recommended limit for
their instances was 70 mb/s while they were a little over that limit (~72
mb/s). The solution was predicted from the beginning they just could not
figure out the root cause.

The moral of this talk was to write down your initial hypothesis to keep
track of what you have checked. Sometimes, it will lead you in the right
direction even if your reasoning behind the hypothesis was wrong.

Next was a talk by two SREs at Google. Their talk was about an internal
tool they were developing at Google called Coroner. The goal of Coroner is
to handle all crash reporting and analysis on any development machine at
Google. It sends all crash information to a central location where
developers can analyze these crashes.

This works by having all machines at Google listen on a unix domain socket
and run a local helper to track crashes. It deals with deduplication of
crash errors by checking the stash hash within a certain period of time. A
key portion of this project is to deal with Java crashes. It needs to track
things such as OutOfMemory exceptions and heapdumps. In addition, it
streams the HPROF to look at CPU usage.

The first talk was pretty understandable and it gave a good moral at the
end. The second talk was a bit more confusing since we could not actually
see what the team had created and they could only talk about it. Still,
both talks were very informative.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.stevens.edu/mailman/private/cs615asa/attachments/20160502/1a87af31/attachment.html>