[cs615asa] Meetup Summary

Thu Apr 26 16:34:32 EDT 2018

Matthew Mahoney

CS-615 Meetup Writeup

Meetup Attended: Big and Streaming Data Architectures on AWS

Hosted by the group: DevOps Efficiency on AWS – NYC

Time of Event: April 24th, 2018 at 6:00pm

Location: 252 W 37th st 14th Floor = New York, NY.

Meetup Link:
https://www.meetup.com/DevOps-Efficiency-on-AWS-NYC/events/249379457/

Why I Chose the Event:

      When looking for events I, was trying to focus on anything that was
related to Systems Administration and preferably something tied into AWS
since I think it’s an interesting platform to get a better understanding
of. I eventually stumbled across this meetup, it looked like an ideal
candidate since it revolved around saving costs when using cloud hosting
providers – specifically AWS – and how to deal with big Data. Since I
didn’t know a lot about the pricing of AWS Cloud services, and what type of
setups organizations have for processing lots of data, I thought this would
be an ideal way to learn more about it. It was especially good since I
found out that I wasn’t the only one in the class going to this meetup so
it was nice to have a familiar face around.

Brief Summary of the Event:

      The meetup started a little after 6:00pm in the Keyrus Office Space.
They purchased Pizza and beer for the crowd, and people began socializing
and talking about some of the projects that they were working on. It seemed
almost everyone that was going there had a specific interest in cloud
services, and was using something along the lines of AWS, Microsoft Azure,
and Google Cloud. After about 20-30 minutes of socializing, the main event
started, and the first speaker took to the stage.

Presentation One – Keeping compute costs down as data pipelines grow can be
complex but it doesn't have to be, Yuval Dovrat, Director of Solutions
Architecture Of Spotinst.

      This first talk was mainly a marketing talk about the Company
Spotinst which is a cloud service provider built on-top of AWS that uses a
mixture of Amazon On-Demand and Spot Instances to save costs of – according
to Dovrat – roughly 70-90% to the end-user. Spot-instances are AWS
instances which are using excess Amazon resources that have the possibility
of being taken away at any time whatsoever. According to Dovrat, you are
typically given a warning about 1-2 minutes in advance and then your
instance is forcefully terminated. To work around this, Spotinst uses
predictive algorithms, frequent state backups, and diversification in AWS
locations so that they can perfectly save the state of the Spot instance
once it is taken away, and immediately startup a new Spot instance. This is
done by dynamically switching over all storage devices and the Virtual
Network Interface Card to this new instance for uninterrupted service.

      Additionally, they can freely interchange Spot instances and
on-demand instances depending on the workload and availability. In setups,
such as Mongodb, where you can have both primary and secondary servers,
Spotinst allows you to have a dedicated on-demand Primary Server, and Spot
Instances for the Secondary servers since those switching out won’t affect
the application. Spotinst has also been implementing something called
Spotinst Functions which will allow you to run functions in a serverless
manner such as map-reduce processes, on-top of abstracted Spot instances to
minimize user cost.

      At the end of this talk, one person in the crowd asked a question
about how to deal with a large number of instances getting shut down
simultaneously and running into AWS API key limitations when trying to spin
up enough new Spot instances.

      Spotinst is able to handle this through two primary ways:
diversification of Spot instance location, and staggering relaunch times.
Typically, when a bunch of spot instances are going to disappear at a
single point in time, it usually happens in a single region. By splitting
up your Spot instances over multiple AWS regions, you minimize the odds of
everything getting taken down simultaneously which could cause AWS api rate
limits. The next way to handle this is to also stagger your relaunch
timing. Since each machine has roughly 1-2 minutes before they are
terminated, Spotinst can spread out and gradually recreate those instances
over that entire 1-2 minute timespan to avoid creating a vast amount of
instances in a very short period of time. However, for users that still run
into rate limiting issues due to an enormous number of Spot instances
running – they will occasionally use multiple user accounts for multiple
API keys to try and circumvent the rate limiting.

      Finally, one thing they can do when AWS resources begin to get
depleted, is they will attempt to switch the AWS instance type you are
running on, typically moving up in processing capacity. This is to not act
as a bottleneck – yet to still save costs as opposed to spinning up an
On-Demand instance. Otherwise, in the worst-case scenarios, they will spin
up On-Demand instances to ensure that your services continue running.

Presentation Two – Benefits of a Serverless Data Pipeline, Itamar Ben Hemo,
Vice-President of North America Keyrus, CEO of Rivery

      This presentation was again, another talk with a company essentially
advertising their own product. However, they bought us pizza, so it’s the
least we can do to hear them out. Keyrus is partnered with a company Called
Rivery which has developed a self-titled Software as a Service platform
which can consolidate data from a large variety of API’s and supply a
management tool and databases and storage for all of this information. The
ultimate goal of Rivery was to take out a large portion of the redundant
and grunt work that the majority of data analytic organizations and
applications go through. In this, he claims that every developer has to
create connectors with a bunch of different desired APIs and keep them up
to date. Additionally, they need to create management tools and storage
solutions for all this data – something he said would take a team of 3-4
Developers around four months to develop and a cost of around 100-140k to
get fully functional. Additionally, he believes that maintaining and
updating the connectors to an ever-increasing set of APIs can be draining
on an organization and that Rivery will allow the team to focus on the
important parts of understanding the data as opposed to acquiring it.

      Rivery also scales using AWS instances and Docker containers. It can
scale up and down by dynamically adding and removing Docker containers on a
particular instance to try and support an increased workload, or to save
costs for a decreased one. Additionally, it can scale in and out by
creating and deleting AWS instances depending on the load.

Presentation Three – Comparing Big Data Architectures on AWS, Ori Rafael,
CEO of Upsolver.

Unfortunately, due to technical difficulties of being unable to get their
presentation to get displayed, this talk had to be canceled – bringing an
end to the event.

What you Learned:

      This was really a great learning experience for me, mainly because I
got to learn about Spot instances. I never knew that they existed, and the
concept around Spotinst’s business model really is quite interesting.
Additionally, it was interesting to hear how a little bit of their
predicative algorithm is able to work – mainly due to the scale of Spot
instances that they run – and also how they deal with API rate limiting,
and server outages.

      Rivery was interesting, but it really wasn’t a cool new concept – it
was mainly a technical and practical implementation for a common problem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.stevens.edu/pipermail/cs615asa/attachments/20180426/b4cee85b/attachment.html>