[cs615asa] HW #N

dhupp dhupp at stevens.edu
Sun May 17 22:44:49 EDT 2015


On March 11th I attended the Big Data Engineering & Development "Lessons 
Learned from the Trenches" talk by Dip Kharod. I chose to attend this 
talk for a couple of reasons. First was that it was close to my home. 
Second was that the topic of big data would have major intersections 
with my job after graduating Stevens Institute as a database 
administrator.

The meetup stated that it would cover topics of:
- Securing data multi-tenant environment
- Managing resources in multi-tenant environment
- Data Wrangling and Preparation
- Value Based Project Planning and Execution

Through out his presentation Dip talked about an example system and how 
it was setup and worked. He broke it down into four main steps.

Ingest - Using the Hadoop data lake concept, they were able to import 
data in its raw form without worrying about manipulating it to fit a 
schema. Transform - From there use some sort of tool or script to 
transform the data into the necessary format to be worked with.
Advanced Analysis - Once the data has been transformed to fit a schema 
and placed into the database proper, it can be analyzed to provide 
meaningful data.
Reporting - Display/produce the conclusions shown from the analysis 
phase.

One of the interesting things that Dip mentioned was that because of the 
way that the data lake works they can feed all of the data from the 
source into the database for processing including any false, corrupted, 
or incomplete entries. Provided that the script or method performing the 
transform step to be intelligent enough it will still manipulate the 
data to fit into the schema necessary but it will mark it as tainted. 
The data analysts that are performing the analysis can then choose 
whether or not they want to utilized the tainted data and how much 
weight the tainted data holds.

The presentation then touched lightly on the managing and securing of 
the Hadoop cluster, basically saying to use the tools and features such 
as HDFS or Hadoop Distributed File System that Hadoop provides to you to 
do these tasks.

Meetup Link: 
http://www.meetup.com/Big-Data-Palooza-New-Jersey/events/220733080/


More information about the cs615asa mailing list