[cs615asa] HW #N
dhupp
dhupp at stevens.edu
Sun May 17 22:44:49 EDT 2015
On March 11th I attended the Big Data Engineering & Development "Lessons
Learned from the Trenches" talk by Dip Kharod. I chose to attend this
talk for a couple of reasons. First was that it was close to my home.
Second was that the topic of big data would have major intersections
with my job after graduating Stevens Institute as a database
administrator.
The meetup stated that it would cover topics of:
- Securing data multi-tenant environment
- Managing resources in multi-tenant environment
- Data Wrangling and Preparation
- Value Based Project Planning and Execution
Through out his presentation Dip talked about an example system and how
it was setup and worked. He broke it down into four main steps.
Ingest - Using the Hadoop data lake concept, they were able to import
data in its raw form without worrying about manipulating it to fit a
schema. Transform - From there use some sort of tool or script to
transform the data into the necessary format to be worked with.
Advanced Analysis - Once the data has been transformed to fit a schema
and placed into the database proper, it can be analyzed to provide
meaningful data.
Reporting - Display/produce the conclusions shown from the analysis
phase.
One of the interesting things that Dip mentioned was that because of the
way that the data lake works they can feed all of the data from the
source into the database for processing including any false, corrupted,
or incomplete entries. Provided that the script or method performing the
transform step to be intelligent enough it will still manipulate the
data to fit into the schema necessary but it will mark it as tainted.
The data analysts that are performing the analysis can then choose
whether or not they want to utilized the tainted data and how much
weight the tainted data holds.
The presentation then touched lightly on the managing and securing of
the Hadoop cluster, basically saying to use the tools and features such
as HDFS or Hadoop Distributed File System that Hadoop provides to you to
do these tasks.
Meetup Link:
http://www.meetup.com/Big-Data-Palooza-New-Jersey/events/220733080/
More information about the cs615asa
mailing list