[cs615asa] [CS615] HW#N - Attend a Meetup/Talk/community event.

Anwar Alruwaili aalruwai at stevens.edu
Tue May 3 20:02:06 EDT 2016


Hello everyone, I went to Engineering NYC Tech Talks in New York. My name
is Anwar, and I attended this event on April 20th 2016.
I took some pictures of detecting the memory, they are available on my personal
course website   https://www.cs.stevens.edu/~aalruwai/CS615/HWN/
The First Presentation:The speaker was Alexis Le Quoc. He is co-founder of
Datadog. His topic was a postmortem, a technical deep dive into what went
wrong. He shared what he learned.The reason of this presentation was when
there was a service that had a problem, and they fixed the problem, and
then there was unknown way to fix it, or they didn't understand how they
fixed it. What happened is that there was an issue on the website. The
symptoms were the first navigation got slow and then got frequent and
intermittent errors. On the internal system there was a HTTP status code
5xx errors on their website application. It affected all pages, web
services/servers, and SQL. They had a hard time presenting the results, and
the 5xx errors were slower to respond but had no stack traces That meant
the server was aware of the errors or unable to perform the requests.

The session stage: Le Quoc also talked about their applications that
requested cookies of a part of existing sessions, and so the application
did not request the users to sign in again for every request time. The
tracking network issue was 5xx to TCP retransmits, and TCP packets used on
sequence numbers and (ACK or SACK) to perform reliable delivery and
congestion control. An example of this TCP retransmit was when the packet
is sent, the sender is waiting for acknowledgement. There is retransmission
timeouts could drop the packets after some milliseconds or less than a
second. More info https://tools.ietf.org/html/rfc5681

Le Quoc also talked about whether there was an issue with busy CPU in the
cloud. When they are starting with busy CPU in AWS service in the cloud,
they need to track the system metric in datadoghd tools. Incoming packets
need to work on their CPU cycles.
He talked also about what causes TCP retransmits: The virtual machine
monitor could drop the packets to perform quality of service (QoS) between
the virtual machines. The internal network could drop packets when the
traffic is busy in the internal network.

At the end he talked about the lesson they learned. They need to check all
instances in the AWS cloud for network performance that is not the highest
quantified resource in the cloud. Here are some detailed descriptions of
some parts of this presentation,  Elevated Web exception rate
<https://datadog.statuspage.io/incidents/7t1dnmbhrxk2>.
https://datadog.statuspage.io/incidents/7t1dnmbhrxk2

At the end, they are hiring! :)


He covered many topics, and it is a good point to focus on technology
details when something goes wrong. Also it is beneficial to know more about
the health system. We need to focus on a health system check to avoid
problems, or to help us find solutions. For example, one should
automatically check status and configuration of the system, to create a
report on specific problems, and then the fun part is analysing the data!
The question is how to collect and to classify data, automated alerts,
investigating the problems, and getting more info about performance
problems. we need to work on detected processes such as : affecting
traffic, affecting all pages on web services/servers, or SQL. we need to
describe the problem clearly because it is not easy to keep track of the
systems and identify the problems.
The Second Presentation:The speakers were Matt and Paul, and they talked
about crash analysis. They are both software engineers from google. They
are spending 20% of their time on this project, which is crash reporting
and analysis.

The reason of this presentation, was in 2014 most services debugged data
and backtracking to the local desk of the servers when the binary's
crashed. There is problematic for a lot of reasons, but the biggest issues
are that most teams rely on using servers of rotating data when they had to
go debugging problems later on, and this is very a manual process. At
google, they have engineers who would kind of monitor the stack when a task
crashed, so they downloaded it before someone was claimed anything. The
goal is to make crash analysis services, for many engineers who run the
product can use it at google. The want to provide more secure centralized
collection services that can be used by all google teams, then the teams
collect all types of data that would be useful to users to debug crash
binary such as stack traces and bugs or anything else...

   - They want automatically analyze failures, and not just to debug data
   in one centralized location.
   - They want to detect crashes to avoid overcorrection.
   - They want product the systems come failure.
   - They want to memorize the compiling time and run time capacity.

The implementation was an analysis detection such as: aggregates analysis
and query of death detection. Crashing binary things that someone wrote and
now they are not running, like when binary is running on a single server,
and this binary needs to stream the crashed data of the machine to an
external service. The communication with the external system was when
google binary crashed, memory could be corrupted. They do not want to do
network communications and memory allocation. Their solution was when they
run heap process in every google machine that listened on socket for crash
data, the crashing tasks open socket and streams of bytes to heap process.
This communication protocol is between crashing binary and heap process to
avoid any memory allocation. The heap binary can simplify this crash data
into protocol buffers then, send that to their remote procedure call (RPC)
service in their server. It sent the owner of the binary, name of the
binary, the time of crash (more info about the crashing)

The (RPC) service is a method that provides a way for external applications
to communicate with web applications.

They also talked about streaming hprof and java heapdump analysis tools.
The hprof tool is to test control applications that exhibit such as memory
leaks, CPU, or deadlocks behaviors. It is useful to know and to learn about
using hprof to pinpoint a memory leak. That can help us analyze memory
usage to product the user data. And imagine we do not use our tools or
reports about memory leaks, and we have 100 binaries and all of them crash
at the same time! We should know when do we use the heap and the stack.

Google event (Engineering NYC Tech Talks) was one of the important events
on technology, and it helped us as students to follow up with the really
interesting topics such as (operations and customer support) I was
interested to hear from different people's experiences, and their recent
projects. After that I met some employees from different companies, and it
was useful to hear their opinions, and some issues that were related to
these topics. The topics were useful in our course. As a system
administrator is important to analyze and to run a report to automatically
extract memory leaks and reduce memory consumption in the system.

Thank you!

Anwar Alruwaili.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.stevens.edu/mailman/private/cs615asa/attachments/20160503/5d18a34e/attachment-0001.html>


More information about the cs615asa mailing list