Are you interested in learning Hadoop but did not know where to start?
If the answer is yes, then this post will walk you step by step through the learning process.
Why we decided to write this post?
The answer to the question is quite simple. There are so many EXCELLENT books, blog sites, documentation, articles and websites on Hadoop. There is abundance of good information available in the internet free of cost but they are all scattered. This makes someone new to Hadoop overwhelmed and they get lost. Don’t feel bad that you feel lost trying to learn Hadoop, most new learners feel the same way. There is no step by step approach to guide new learners to learn and cherish this amazing technology. This post is aimed to break that road block. In this post we will walk you through step by step, day by day (for 7 days) and what you will learn each day and how to do understand concepts better without getting overwhelmed.
Note to someone who is new to Hadoop
If you are new to Hadoop, don’t think of this post as read once and done. Bookmark this page. Because you will be coming back to this page again in the next coming days. Think of this as a course rather than a article or “just” another blog post. So from here on we are going to call this post a “course”. Before you decide to spend 100s or 1000s of dollars in Hadoop training spend few days on what is laid out in this post (course) and give it a honest shot. You will be amazed how much you can learn in just 7 days.
And the best part is we are only going to spend 3 to 4 hours in a day for 7 days to learn this amazing technology.
We will answer one of the foremost common questions from Hadoop learners – Do I need to be a Java (or any language) programmer to learn Hadoop? The answer is, absolutely not. Hadoop ecosystem is massive and there are so many tools in the ecosystem which does not require Java or any programming background.
This post (course) is literally for anyone who wants and willing to learn this AWESOME technology.
WHAT YOU NEED?
You would need 3 to 4 hours each day for 7 days. If you have the focus you can do a lot of topics in a day and finish the course faster. However if you are totally new to the technology we suggest you to take it slow so that you can fully understand the concepts. Give yourself enough time to do the hands-on as well.
Hadoop – The Definitive Guide (3rd Edition) by Tom White
Hadoop – The Definitive Guide (3rd Edition) by Tom White is like the bible for Hadoop. Anyone who is experienced in Hadoop have read this book at least once. If you don’t have your copy, click on the below link to get it from Amazon. This a must have for this course. We will be using this book to study the concepts and for practice. All the references made on this course is from the Hadoop Definitive Guide book.
Why buy a book when you have so many materials online and for free?
Here are few reasons why –
- Yes, there is a lot of free Information available online but as we said they are scattered all over the place. No wonder new learners are getting lost.
- You don’t want to spend months and months trying to learn Hadoop. This book has a very good structure and will keep us focus throughout the learning process.
- It’s a great book and it is certainly worth the investment.
If I need the book, why do I have to follow this day by day course? Why can’t I learn it straight from the book?
Excellent question. Hadoop – The Definitive Guide is an excellent book, it is well structured and goes to great depth in explaining Hadoop. That is why it is named rightfully as the “guide” and it is one of the books you need to have in your bookshelf if you are into distributed systems.
In this course, we won’t be asking you to go through the book chapter by chapter. We will skipping some topics and even some chapters. We will even cover topics in a different order than how it is listed in the book. Why? For new learners this book can be intimidating and if you are just starting with the technology there is no need to study the book cover to cover. We will pick the topics that helps us understand the technology without losing any key information. Once you are done with the course you will be in a great position to pick up topics where you need more help and would like to know based on the need.
P.S. All topics in the book are important and carefully structured. We will be skipping some topics from the book in this course and this is not to say that the topic is not important. The goal of this course is to give you a head start with Hadoop and the intention is not to restructure the book. You should come back to topics that we skipped on an as needed basis.
Cloudera (or any distribution) VM
Just like you can not learn how to ride a bike just by reading a book, you can not understand Hadoop fully just by reading a book or reading this course. You to need to practise to get the most out of this course.
Don’t spend time setting up a Hadoop cluster by yourself or using cygwin in your windows laptop. We see lot of new learners do this and it is a waste of time. Spend time in learning and understanding the technology and then you will be in a better position to set up a cluster of your own with very little effort.
If we are not setting up a cluster, how can we practise? Again, great question. Most companies offer Hadoop distribution offers a Hadoop Sandbox with Hadoop and all necessary Hadoop components pre installed. This is a great way to start because it will give you an excellent head start. You can download the sandbox for any of the vendors. Here is the Quick Start VM link for CDH 5.2 from Cloudera.
ABOUT THIS COURSE
Now you know what you need to start with this course. This is 7 day course. Meaning you will be learning a topic each day. So before you start with Day 1 read out what is listed in each day for 7 days and plan ahead. We have suggested the number of hours needed for each day and this will help you to plan in advance. Once you start with Day 1, stick with your plan and finish it in 7 days straight. Don’t prolong the course. Remember to practise. The course is not complete without practise. To help you to make sure that you get the most out of the course, we have included questions that you should try to answer before moving on to the next day. Answering the questions satisfactorily will give you the confidence to move on to the next day and will also identify the areas for improvement.
Now lets gets started !!!
DAY 1 – WHY HADOOP? WHY HADOOP? WHAT IS HADOOP?
No of hours needed – 1 hour
Go to Chapter 1: Meet Hadoop (Hadoop – The Definitive Guide book)
This chapter talks about Hadoop in a high level and explains the Big Data problem. The most valuable lesson from this chapter is the comparison of Hadoop with other systems like RDBMS or traditional distributed processing systems. Don’t expect to learn about HDFS and MapReduce yet.
Start reading “Data!” section. This will give you some examples of real world Big Data scenarios. for eg. Facebook hosts approximately 10 billion photos, talking up one petabyte of storage. You probably heard those scenarios 100 times already. If you feel that way, don’t waste your time move on to “A Brief History of Hadoop”.
“A Brief History of Hadoop” talks about the problems faced by Doug Cutting (cofounder of Hadoop) and Google’s solution to their Big Data problems. Timeline of Hadoop and its evolution. This is a very quick read.
Go to “Data Storage and Analysis”. This section talks about the problems with traditional storage and how the disk access speeds affects processing times. There are mentions of HDFS and MapReduce in this section but there are dedicated chapters on HDFS and MapReduce.
Now move on to “Comparison with Other Systems” section. This is a very important section as it compares Hadoop with RDBMS, Grid and Volunteer computing. Every system has its pros and cons and Hadoop is no exception. Hadoop is good at doing something and not so good in few things. Try to understand those key aspects.
Ignore the “Hadoop Releases” as the book calls out specifically in each case when it is referencing an older release.
That is it, you are done with Chapter 1 and yes we skipped few topics and that is OK. This book is like a reference guide for Hadoop. We can always come back to it on topics on an as needed basis.
TEST YOUR SELF
- What is the need for Hadoop when you have traditional RDBMS?
- Compare Hadoop vs RDBMS
- What are the technical difficulties when you are dealing with BIG datasets?
- What are the areas where Hadoop not so good at?
Day 1 is light if you have the energy and intrigued by the technology and can’t wait then move on to Day 2 right away.
Day 2 – HDFS
No of hours needed – 3 hours
Go to Chapter 3: “The Hadoop Distributed Filesystem” (Hadoop – The Definitive Guide book)
We are not entirely skipping MapReduce (chapter 2), we have just switched the order of the chapters.
Start with the introduction and “The Design of HDFS” section. Then move on to “HDFS Concepts”
When you read about “Blocks” make sure why the Block size is so large in Hadoop as compared to the 4 KB or so cluster size in traditional file systems. Understand the significance of blocks. Understand how blocks help with sequential disk reads.
Move on to “Namenodes and Datanodes”. Understand the importance and functionalities of the Namenode and the Datanode. Skip “HDFS Federation” and “HDFS High-Availability”
At this point you should know how HDFS offers fault tolerance and scalability. Move on to “The Command-Line Interface” section. Now it is time to get hands-on with some HDFS commands. Try HDFS commands in Cloudera VM. The book has listed only few commands at this point. Try more commands from the documentation
Now you understand what is HDFS and we know how to interact with HDFS using shell commands. Now it is time to understand the internals of HDFS read and write. Go to “Data Flow” section and read about “Anatomy of a File Read” and “Anatomy of a File Write”. Make sure you understand what happens behind the scenes when you read and write files to HDFS.
Go to Chapter 4: Hadoop I/O and read just the Data Integrity topic. Skip Compression and other topics under that chapter. We will get to those later.
That is it, you are done with HDFS chapter (for now). We have skipped a lot of sections but we have learned what we need at this point about HDFS. We will be coming back to this chapter at a later stage.
- How is HDFS different from other File Systems?
- Why the block size for HDFS is huge compared to traditional file systems?
- How do you get fault tolerance and reliability with HDFS?
- What is the purpose of Namenode?
- Did you try out commands?
DAY 3 – MAPREDUCE PART-1
No of hours needed – 4 hours
Now we know a lot about HDFS, we are in an excellent position to learn about MapReduce.
Go to Chapter 2: MapReduce (Hadoop – The Definitive Guide book)
Tom White has chosen to use the Weather Dataset to teach MapReduce and we feel it is a great and easy way to understand and illustrate as compared to the Word Count example. Java knowledge is essential for this chapter. If you are not a programmer, still go over this chapter. Without this chapter it is very hard to understand most of what Hadoop can offer.
Go over all the following topics and all their corresponding sub topics – A Weather Dataset, Analyzing the Data with Unix Tools, Analyzing the Data with Hadoop and Scaling Out. Once you are done with reading those topics you should understand the basics of MapReduce. Don’t try to run a MapReduce job yet. Lets understand what’s goes on behind the scenes when you run a MapReduce job first. Why? Because if the job fails you will be in a better position to troubleshoot knowing the processes that are involved in a MapReduce execution.
Go to Chapter 6: How MapReduce Works in Hadoop – The Definitive Guide book.
Read and understand the topics and theirs corresponding subtopics – Anatomy of a MapReduce Job Run, Failures.
Now run the MapReduce job in the Cloudera VM. Follow the output, check the Application Master UI and make sure the job executed successfully.
Now we have covered the basics of both HDFS and MapReduce. At this stage you should be in a position to comprehend what Hadoop can offer.
- What is the input to Mapper?
- How does a Reducer gets its input?
- What is Job Tracker & Task Tracker?
- What is Resource Manager, Application Master and Node Manager?
- YARN vs. Classic MapReduce
- How failures are handled in both Classic MapReduce and YARN
- Did you try to run a MapReduce program in Sandbox?
DAY 4 – MAPREDUCE PART-2
No of hours needed – 3 hours
Now we know the basics of MapReduce, it is time to go a little deep and learn a few more concepts. As rightfully said in the book, Shuffle and Sort is the heart of MapReduce and is where the “magic” happens. So lets go to Chapter 6 and go over the topic Shuffle and Sort. In the same chapter go over the all the sub topics under Task Execution. Pay close attention to Speculative Execution and Task JVM Reuse as they are usually asked about topic in the interviews.
Go to Chapter 5: Developing a MapReduce Application and read the full chapter.
Go to Chapter 8: MapReduce Features and read all the topics. This chapter could be a little intense for some learners. Join is a pain to implement in MapReduce and most of the time Joins are performed using tools like Pig and Hive. Make sure to understand Distributed Cache as it is used in optimizing joins.
- What are optimizations you can do in the shuffle phase?
- What is speculative execution?
- What is distributed cache and how can it be used with Joins?
- Difference between Map and Reduce side joins
DAY 5 – File Formats
No of hours needed – 4 hours
Go to Chapter 7: MapReduce Types and Formats. In this chapter pay more close attention in understanding the role of Partitioner, Input Splits and Records. Hadoop by design can handle a lot of file formats. This chapter will list out some of the most commonly used formats. Understand how each format is different and how each format will be used and move on.
Go to Chapter 4: Hadoop I/O. Go to Compression and understand the issues with Compression and MapReduce. Then go to File-Based Data Structures and read about Sequence and Map file.
Don’t worry about writing custom formats, Writables and Avro. Come back to those topics once you covered and understand all the basics and on an as needed basis.
- How are number of Mapper calculated for a job and what about number of reducers?
- Input Splits vs Blocks
- Input Splits and Records
- How to control the size and contents of Input Splits?
- What is the problem with small files problem in Hadoop and how to tackle it?
- How to deal with multiple input files in a MapReduce job?
- Run a MapReduce program against a bunch of smaller files (less than block size) and check out the number of mappers invoked
- Compress all the smaller files into one big file and run a MapReduce program against it. Now check out the number of mappers invoked
- Convert the small files into one Sequence file with Block level compression and run a MapReduce program against it. Now check out the number of mappers invoked
DAY 6 – ADMINISTRATIVE ELEMENTS
No of hours needed – 3 hours
Namenode is the most important node in your Hadoop cluster. So lets go some key configuration topics to help us deal with Namenode failure.
Go to Chapter 10: Administering Hadoop and cover the HDFS topic. This will explain all about Secondary Namenode and its functions.
Go back to Chapter 3: The Hadoop Distributed Filesystem and read HDFS Federation and HDFS High-Availability topics. These topics are primarily to protect your cluster during Namenode failures.
Go to Chapter 9: Setting Up a Hadoop Cluster and read the following 2 topics – Security and Benchmarking a Hadoop Cluster.
Go to Chapter 6: How MapReduce Works and read Job Scheduling and different types of scheduling options available in the recent versions of Hadoop.
There are several ways to configure and stand up a Hadoop cluster. If you are configuring a CDH cluster the easiest way to use Cloudera Manager on AWS. We will write a separate step by step guide to configure a AWS cluster in the coming days.
- What is the difference between Fair and Capacity scheduler?
- What is the function of the secondary Namenode?
- What can you do to protect yourself from a Secondary Namenode failure?
- Explain checkpointing process done by Secondary Namenode
- Benchmark Namenode
- Configure Capacity scheduler
DAY 7 – HADOOP FOR NON JAVA PROGRAMMERS
No of hours needed – 3 hours
Hadoop is not only for Java programmers. Non Java programmers can also use Hadoop and below topics will describe tools designed non Java programmers and inter program operability. Still go through this topic even if you are a Java programmer.
Go to Chapter 2: MapReduce and start with Hadoop Streaming and the sub topics underneath it
Go to Chapter 4. Hadoop I/O and start with Avro and the sub topics underneath it
- Run a MapReduce program using Streaming
- Create an Avro file using Java and read the file using Python
That’s it. If you have finished this course, give yourself a pat in the back and GREAT JOB !!! Comments us your experience and how we can improve the course for others who are just starting with Hadoop.
Learning Hadoop does not stop here. Next step is to learn some key tools from the Hadoop ecosystem.
WHAT’s NEXT? – TOOLS FROM HADOOP ECOSYSTEM
Here are the important tools from the Hadoop ecosystem that you should learn next. There are books available for each topic. Hadoop – The Definitive Guide has good chapters to start with on Pig, Hive, Sqoop, HBase. If you wish us to write a similar “course” for any of the tools you are interested in, please leave us a comment.
Concentrate on the tools in the order it is listed. Please note that this list is not comprehensive. But below tools are must know tools.
- Hive – Pay more close attention to optimizations to Hive queries and also through structuring the data. Focus on differences between Partitions vs Buckets etc. Also learn to use Hive with different file formats like Avro, RCFile, Sequence File. If you are a programmer you should learn more about SerDe and UDFs.
- Impala – When you learn Hive, Impala is a walk in the park.
- Pig – Everything that is mentioned for Hive applies here.
- Sqoop – Sqoop is important because database in an IT organization is unavoidable and any company who wants to invest in Hadoop will go to their database as their golden source of data to begin with. Spend time understanding incremental imports and its modes.
Next logical progression is to learn about NoSQL databases. Each database is big topic in itself but any open Hadoop developer positions will ask for some flavour of NoSQL experience. Here is our ordering of NoSQL in terms of popularity when it comes to job openings.
Finally, share the joy !!! Share it with your friends with the below social icons.