[cs615asa] HW#N: Attended a relevant Meetup/Talk/community event

Smruthi Karinatte skarinat at stevens.edu
Mon Apr 18 16:27:10 EDT 2016


Hi All,

I attended the Google TechTalk on Site Reliability Engineering on Feb 17th
2016. Following were the points discussed in the tech talk.

Background of SRE:
Site Reliability Engineering(SRE) is next generation DevOps. The mastermind
behind SRE is Google.

Justification for SRE:
Its an age old battle of powers between Development team and Operation
team. Development team wants to release new and improved features to the
masses and see it spin off. On the contrary the Operations team want to
make sure that these new features don’t mess with existing running
environment. This is where an SRE comes into picture. SRE help resolve the
conflict of what can be launched and when by introducing softwares that
decides that the feature is safe or unsafe to launch. This software is also
commonly termed as “green- or red-lighting launches,”. So basically SRE are
 software engineers with DevOps skill set.

Responsibilities of SRE:
The responsibility of the SRE is same as that of as System Administers or
Operations team. Like availability, latency, performance, efficiency,
change management, monitoring, emergency response, and capacity planning.
In other words keeping the system up and running. But unlike the Devops
team SRE works on automating the systems checks and various other
responsibilities rather than repeating the task manually on every system.

In the tech talk presenter Tom, who is an SRE Engineer in Google spoke
about his responsibilities. He was part of the Production Monitoring SRE
team. This team manages Google’s internal monitoring infrastructure.
Following were their responsibilities.

·        Storage of Monitoring data

·        Tools that support queries and analysis

·        Alerts & notifications

·        Telemetry for services running in production datacenters

 SRE Team composure

What distinguishes SREs from the Devops are the people who work in the
Team. The SRE team is a 50-50 mix of software engineers who know about data
structures, algorithms, programming languages and also bout system
engineering or network engineering.

Benefits of SRE :
Going back to the case of SRE of production monitoring team. His mantra was
that “Monitoring was as solved problem” which literally means that
monitoring was no longer a job that needed human intervention and was
completely automated so that the human/engineers could better utilize their
time.
In this case the DBA writes a query to retrieve data from his MySQL server.
When a user queries the database, depending on where the data is stored the
whole table may be required to be scanned. The more data you have, the
longer it takes to read or write. Longer the processing time, as you to
scan all the data.
What to real queries look like. Do they efficiently provide the information
required.
·        The information of how long has been the process  X been running
(One target, one metric, latest value)
·        Show a graph of Y requests over the last hour, day,week,year
( Multiple targets, one metric, varying time ranges)
·        Show a graph  of all requests over  the last hour, day, week, year
(Multiple targets, multiple metrics, varying time ranges)
The bucketing strategies for these queries are very useful.

Integrity before availability:
Availability of 99.99% roughly means a total of less than 53 minutes of
downtime in a year; generally, users recover and life goes on.

Integrity of 99.9% means for 2GB data, 200+KB of corrupted or missing data.
These losses can have unrecoverable consequences.

For example
§  A document has lost several pages
§  An executable is useless
§  A database is corrupt
§  A video is garbled

Hence to summarize, Small data loss can have big consequences.

Availability will be important only if the data that is available is intact
ie not corrupted. To ensure data integrity following point need to be
followed.
·        Plan
·        Practice
·        Automate, Test and Adapt
·        Avoid horror, panic and tragedy

After Attending this tech talk, i became aware of the emerging role of SRE.
Their roles and responsibilities. This talk also helped me understand the
distinguishing factor between an SRE and System Administrator. SRE is a
field where people with good system knowledge and coding skills can find a
perfect role. They would develop tools to automate certain jobs  that do
not require human intervention, so that humans/engineers could utilize
their time for more intellectual work.
Also learned to mix and mingle with people of same interests. Overall it
was a good learning experience.

I chose to attend this tech talk as SRE is the next step to System
Adminsitration and who better to explain it than Google Techies.

-- 
Regards,
Smruthi Karinatte



-- 
Regards,
Smruthi Karinatte
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.stevens.edu/pipermail/cs615asa/attachments/20160418/6701c968/attachment.html>


More information about the cs615asa mailing list