[cs615asa] [CS615] SignalFx Meetup

Patrick Murray pmurray1 at stevens.edu
Sun Mar 11 20:53:35 EDT 2018


*Title*: SignalFx - Real-Time Operational Intelligence
*When:* 2018-03-01

Prior to the Spring break, I attended a meetup organized by SignalFx - an
organization that builds a monitoring & alerting product for tracking the
health of infrastructure across a wide range of environments, including:
physical, hybrid, and cloud-native. Realizing that my exposure to
monitoring tools is nearly non-existent (beyond tinkering with Grafana on a
few occasions), this event seemed like a promising opportunity to prepare
for the beginning of my career as an SRE this June.

Although there were many lessons that I learned at this meetup, the most
surprising takeaway was the recommendation to restrict the creation of
data-sources to only SREs or privileged developers. Surprisingly, the logic
behind this decision was not to restrict or "nerf" monitoring tools - but
rather to prevent situations where too many data sources are monitored,
such that no additional knowledge is learned when an outage occurs. Rather,
these data-sources should reflect real business metrics and only those
technical details which are necessary to quickly hint where an issue may
reside. For example, under the worst circumstances SREs must be able to
quickly identify and troubleshoot the cause an outage in less than four
minutes (in order to maintain a four-9s SLA). Albeit this situation is
truly a worse case - in an ideal world, properly implemented alerting and
health checks would warn of any issues long before their impacts are felt.

Please feel free to let me know if you have any follow-up questions,
comments, or concerns!

Best,
Pat

*Notes:*
6:25pm - Start

Presentors: Travis Paterson, Sales, SignalFx

- Cloud workloads have caused massive disruptions in the industry.
- The abundance of high quality open source tools has made deploying
  microservices across environments much easier.

  i.e. Kubernetes, Hashicorp Terraform, AWS Cloud Formation

> 76% of enterprises deploy monitoring for their microservices
> 50% of cloud-software will utilize a serverless architecture for at least
  part of their components

Decentralized Chaos Environment

- Organizations whose infrastructure is scattered across physical hardware
and
  cloud resources based mainly on the teams who develop & maintain them.

> Yelp - high velocity infrastructure, flushes their entire infrastructure
on a
  daily basis
> Monitoring is necessary to maintaining a high SLA and address issues
quickly

  i.e. 99.99% uptime = 4 min down/month

SignalFx - Largest customer throws 40 million data points per minute at
their
           service (DPM)

Real-Time Predictive Analytics

- Founders created Facebook's Operational Database (ODB) which is used to
this
  day for their production monitoring.

  "move fast and break things" -FB

> Churn, number of changes to a system

- Capacity planning = forecasting infrastructure needs and expenses in the
  future based on historical data

> For performant historical data lookup, partition data across multiple
  databases and hosts.

> Aggregate time-series data down to different periods/frequencies to allow
  better granularization.

Open Source Data Collection Agent

> collectd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.stevens.edu/pipermail/cs615asa/attachments/20180311/17d3f9c3/attachment.html>


More information about the cs615asa mailing list