What do you Feed a Big Yellow Elephant Named Hadoop? Datasets

15 Nov, 2015

I host a weekly podcast called the Nutanix community podcast, with Dwayne Lessner, a colleague of mine. It’s a community podcast, where we talk about technology, the latest developments with Nutanix and chat with folks from the broader IT community.

This last week, Dwayne and I chatted with Andrew Nelson, about Hadoop. I must admit, I have never had to deploy or support Hadoop, but speaking to Andrew and Dwayne – got me interested in exploring what Hadoop is and how it’s applied.

It’s the software you will find many in web scale companies like Facebook, Google, and Yahoo and nowadays, you will find it in the cloud on Microsoft Azure and Amazon. The Wikipedia page for Apache Hadoop shares the following information about running Hadoop with Amazon – The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs within 24 hours at a computation cost of about $240 (not including bandwidth).

What is Hadoop and why the yellow elephant?

If my understanding is correct, Hadoop is a way of storing large data sets across clusters of computers and also enabling you the ability to process that data. You can think about it in two ways, the storing of data which uses Hadoop Distributed File System (HDFS) and the processing of data using Map Reduce.

Here is a side note, ever wonder why it’s called Hadoop and uses a yellow elephant as its logo? Well, Doug Cutting, one of the co-creators, named it “Hadoop” after his son’s yellow plush elephant toy. That’s one of the stories I heard, and I’m sticking with it.

This video from Mike Gualtieri, who is a Forrester Principal Analyst, is an excellent resource and simplifies what Hadoop is and how it works.

You can find Hadoop from the Apache software foundation, and several big name companies have a Hadoop offering, such as Cloudera and Hortonworks.

You can also find past episodes of the Nutanix Community podcast here, please let me know what you think of the podcast. Consider subscribing, you can find it on iTunes, Spotify or SoundCloud. If you are interested in being a guest, add your details in the comments below. If you’ve used Hadoop, I would love to hear about your experience as well.

Please let me know what you think on Twitter, and thank you for reading.