[cs615asa] HW#N Meet up report
Fei Fan
ffei1 at stevens.edu
Thu May 4 02:25:55 EDT 2017
Hello all,
I attended the meet-up in NYC on March 22, 6:30 PM, the topic of which
is “Data Analytics: How Equinox built a data platform on AWS using
Redshift and EMR”. The meet up was organized by aws New York user group
in the training room of Equinox on 1st Park Ave. Manhattan.
The meet-up was divided into two parts:
I.
The first part is the introduction of the AWS products and some of the
“Best practices” using these products, given by Mr. Ilya Epshteyn, a
solutions architect of AWS NYC. In his speech, he first introduced EMR
(Elastic MapReduce) and compared the traditional on-premises Hadoop
cluster to the cloud (ec2) cluster. He pointed out that for the
on-premises cluster, users (or sysAdmins) may need to do a lot of tasks
to build up their own cluster, such as choosing/purchasing the
hardwares, setting up the cluster (configure non-password ssh login for
the task nodes), run MapReduce tasks on nodes and monitoring the nodes
status.
With EMR, users will no longer need to do these complicated tasks or
write sophisticated system tool to automate the jobs. The only thing
that users need is to write their MapReduce program and upload the JAR
file to EMR with indicated data source path from Amazon S3 where the
date is stored. EMR will use EMRFS (an implementation of HDFS) serving
as the file system for EMR. This solution abstracts the implementation
of the underlying Hadoop+HDFS and provide a smooth learning curve for
the Hadoop users. Since Amazon S3 uses 3-way replication, it will make
sure the data’s integrity and robustness to internal system failure.
The biggest advantages for EMR is the scalability. Users can run their
tasks on cluster of different sizes, for example, from 10 nodes to 400
nodes. For the traditional scenario, this is a predictably difficult
task for the sysAdmin to automate adding, removing and configuring the
nodes if he/she wants to resize the on-premises cluster. And another
advantage comes with this, is the “Spot Instances” provided by AWS. To
achieve the least cost for running the cloud cluster, users will only
need to set a bid price for the computing resources. And AWS will find
the lowest priced computing resources in the market to support the
running tasks. If the bid price is lower than the price of any available
computing resources in the market, AWS will inform the user (or
sysAdmin) to raise the price to keep those instances running. Mr. Ilya
Epshteyn also did a rough calculation on the cost-to-effect for enlarge
the cluster by a factor of 4, to show that the cost will increase with a
percentage 70~80% while the capacity of the cluster will linearly grow
to 4 times as before. (I actually have some doubts with this. With my
experience, running a MapReduce job on EMR cluster with the same size
may take different amount of time. And doubling the size of the cluster
does not make sure you can reduce the time cost to one half. So I just
treated this statement as a way of product promotion)
At the end of the speech, Mr. Ilya Epshteyn mentioned some other
frameworks such as Apache Ranger, a security framework that can be used
by system administrator to apply the access control to all of the Hadoop
components like HDFS, Yarn, Hive and Hbase etc. It provides a web GUI
for the administrator to better understand and control what’s happening
in the cluster, using the bulit-in RestFul APIs. And he also mentioned
some other best practices such as how to manipulate the S3 logs and Yarn
logs, how to separate batch processes from interactive queries (to
obtain better performance for user experience) and the current ecosystem
of AWS RedShift (ETL and BI tools).
II.
The second part is given by Mr. Elliot Cordo, the VP Data Analytics of
Equinox. In his speech, he first introduced the history of the company
Equinox. By showing the growth path and scale of the company (in charge
of several well-known gyms nationwide, incredible), he revealed the
company’s demand for building a data warehouse, as a support to data
analytics for making business strategies as well a providing better
customer service.
In the past, Equinox invested great amount of money to build the
traditional warehouse, and later found out investment is not the hardest
part. How to maintain and upgrade the system became the biggest issue
they were confronted with. As the the VP Data Analytics, Mr. Elliot
Cordo did not say much about the system architecture and
maintenance/upgrade solutions. But we can clearly learn that the
traditional infrastructure required the company to put more human
resources in monitoring the system. At the end, Mr. Elliot Cordo
introduced the reason why Equinox chose AWS as the platform for
migrating the traditional data warehouse to the cloud, and how they
worked with AWS teams.
What I’ve learned:
I took part in this meet-up because I am really interested in cloud
infrastructure and regard it as a chance to look things in a different
way from a sysAdmin’s prospect of view. And I did learn a lot from this
meet-up.
In old days, sysAdmin’s tasks might be writing a lot of small or big
system tools to manage the system’s configuration, tracking the system
status and handle the unexpected system failures. As the cloud
technologies develops, it seems that the era of building, testing and
configuring complex tools for large-scale system, comes to an end by
“putting everything on the cloud”. However, this is not the case.
Conversely, I think this solution only transfer parts of the
difficulties from “How to build the right tools” to “How to choose
between different tools to meet my needs”. These available tools are
usually powerful, useful and complicated. Although they make sysAdmins’
life easier, I am not sure if they will sysAdmins’ life easier using
them. The hidden implementation details and customization according to
the system’s need are becoming some new problems the sysAdmins are
facing with.
Nowadays, with the growth of IaaS, PaaS and SaaS technologies, more and
more companies tend to out-source their on-premises systems to the cloud
to reduce the cost for maintaining their infrastructure, which seems
reasonable because most of the jobs used to be done by techs of
sysAdmins can be simplified by the cloud technologies. However, if we
consider the companies on the cloud, they clearly put the trust on the
AWS infrastructure and staffs, with the beliefs that they won’t make
mistakes intentionally or unintentionally and they will not manipulate
this dependency relationship or set up obstacles for switching to other
platforms. And if we look the system as a whole, with so many companies
out-sourcing to the cloud, the cloud service provider seems more likely
to be the source of a “single point failure” under different thread
models. In my opinion, the control over budget is an issue while the
control over the company’s data, infrastructure or even reputation is a
totally different issue. How to strike a balance between these issues,
is the company’s, especially the system administration team’s question
they should ask them self before any decisions.
To sum up, the job for sysAdmins nowadays is becoming more and more
different from what it used to be. We should learn to play a more
important role, not only in adapting to the technologies we choose, but
also in making the strategies on the company’s growth path.
Link:
https://www.meetup.com/AWS-NYC/events/238481285/
--
Fei Fan
CWID: 10408849 cell: 201-626-8347
Student of Computer Science Department
Stevens Institute of Technology
More information about the cs615asa
mailing list