[cs615asa] HW#N Meet up report

Thu May 4 02:25:55 EDT 2017

Hello all,

I attended the meet-up in NYC on March 22, 6:30 PM, the topic of which 
is “Data Analytics: How Equinox built a data platform on AWS using 
Redshift and EMR”. The meet up was organized by aws New York user group 
in the training room of Equinox on 1st Park Ave. Manhattan.

The meet-up was divided into two parts:

I.

The first part is the introduction of the AWS products and some of the 
“Best practices” using these products, given by Mr. Ilya Epshteyn, a 
solutions architect of AWS NYC. In his speech, he first introduced EMR 
(Elastic MapReduce) and compared the traditional on-premises Hadoop 
cluster to the cloud (ec2) cluster. He pointed out that for the 
on-premises cluster, users (or sysAdmins) may need to do a lot of tasks 
to build up their own cluster, such as choosing/purchasing the 
hardwares, setting up the cluster (configure non-password ssh login for 
the task nodes), run MapReduce tasks on nodes and monitoring the nodes 
status.

With EMR, users will no longer need to do these complicated tasks or 
write sophisticated system tool to automate the jobs. The only thing 
that users need is to write their MapReduce program and upload the JAR 
file to EMR with indicated data source path from Amazon S3 where the 
date is stored. EMR will use EMRFS (an implementation of HDFS) serving 
as the file system for EMR. This solution abstracts the implementation 
of the underlying Hadoop+HDFS and provide a smooth learning curve for 
the Hadoop users. Since Amazon S3 uses 3-way replication, it will make 
sure the data’s integrity and robustness to internal system failure.

The biggest advantages for EMR is the scalability. Users can run their 
tasks on cluster of different sizes, for example, from 10 nodes to 400 
nodes. For the traditional scenario, this is a predictably difficult 
task for the sysAdmin to automate adding, removing and configuring the 
nodes if he/she wants to resize the on-premises cluster. And another 
advantage comes with this, is the “Spot Instances” provided by AWS. To 
achieve the least cost for running the cloud cluster, users will only 
need to set a bid price for the computing resources. And AWS will find 
the lowest priced computing resources in the market to support the 
running tasks. If the bid price is lower than the price of any available 
computing resources in the market, AWS will inform the user (or 
sysAdmin) to raise the price to keep those instances running. Mr. Ilya 
Epshteyn also did a rough calculation on the cost-to-effect for enlarge 
the cluster by a factor of 4, to show that the cost will increase with a 
percentage 70~80% while the capacity of the cluster will linearly grow 
to 4 times as before. (I actually have some doubts with this. With my 
experience, running a MapReduce job on EMR cluster with the same size 
may take different amount of time. And doubling the size of the cluster 
does not make sure you can reduce the time cost to one half. So I just 
treated this statement as a way of product promotion)

At the end of the speech, Mr. Ilya Epshteyn mentioned some other 
frameworks such as Apache Ranger, a security framework that can be used 
by system administrator to apply the access control to all of the Hadoop 
components like HDFS, Yarn, Hive and Hbase etc. It provides a web GUI 
for the administrator to better understand and control what’s happening 
in the cluster, using the bulit-in RestFul APIs. And he also mentioned 
some other best practices such as how to manipulate the S3 logs and Yarn 
logs, how to separate batch processes from interactive queries (to 
obtain better performance for user experience) and the current ecosystem 
of AWS RedShift (ETL and BI tools).

II.

The second part is given by Mr. Elliot Cordo, the VP Data Analytics of 
Equinox. In his speech, he first introduced the history of the company 
Equinox. By showing the growth path and scale of the company (in charge 
of several well-known gyms nationwide, incredible), he revealed the 
company’s demand for building a data warehouse, as a support to data 
analytics for making business strategies as well a providing better 
customer service.

In the past, Equinox invested great amount of money to build the 
traditional warehouse, and later found out investment is not the hardest 
part. How to maintain and upgrade the system became the biggest issue 
they were confronted with. As the the  VP Data Analytics, Mr. Elliot 
Cordo did not say much about the system architecture and 
maintenance/upgrade solutions. But we can clearly learn that the 
traditional infrastructure required the company to put more human 
resources in monitoring the system. At the end, Mr. Elliot Cordo 
introduced the reason why Equinox chose AWS as the platform for 
migrating the traditional data warehouse to the cloud, and how they 
worked with AWS teams.

What I’ve learned:

I took part in this meet-up because I am really interested in cloud 
infrastructure and regard it as a chance to look things in a different 
way from a sysAdmin’s prospect of view. And I did learn a lot from this 
meet-up.

In old days, sysAdmin’s tasks might be writing a lot of small or big 
system tools to manage the system’s configuration, tracking the system 
status and handle the unexpected system failures. As the cloud 
technologies develops, it seems that the era of building, testing and 
configuring complex tools for large-scale system, comes to an end by 
“putting everything on the cloud”. However, this is not the case. 
Conversely, I think this solution only transfer parts of the 
difficulties from “How to build the right tools” to “How to choose 
between different tools to meet my needs”. These available tools are 
usually powerful, useful and complicated. Although  they make sysAdmins’ 
life easier, I am not sure if they will sysAdmins’ life easier using 
them. The hidden implementation details and customization according to 
the system’s need are becoming some new problems the sysAdmins are 
facing with.

Nowadays, with the growth of IaaS, PaaS and SaaS technologies, more and 
more companies tend to out-source their on-premises systems to the cloud 
to reduce the cost for maintaining their infrastructure, which seems 
reasonable because most of the jobs used to be done by techs of 
sysAdmins can be simplified by the cloud technologies. However, if we 
consider the companies on the cloud, they clearly put the trust on the 
AWS infrastructure and staffs, with the beliefs that they won’t make 
mistakes intentionally or unintentionally and they will not manipulate 
this dependency relationship or set up obstacles for switching to other 
platforms. And if we look the system as a whole, with so many companies 
out-sourcing to the cloud, the cloud service provider seems more likely 
to be the source of a “single point failure” under different thread 
models. In my opinion, the control over budget is an issue while the 
control over the company’s data, infrastructure or even reputation is a 
totally different issue. How to strike a balance between these issues, 
is the company’s, especially the system administration team’s question 
they should ask them self before any decisions.

To sum up, the job for sysAdmins nowadays is becoming more and more 
different from what it used to be. We should learn to play a more 
important role, not only in adapting to the technologies we choose, but 
also in making the strategies on the company’s growth path.

Link:

https://www.meetup.com/AWS-NYC/events/238481285/

-- 
Fei Fan
CWID: 10408849  cell: 201-626-8347
Student of Computer Science Department
Stevens Institute of Technology