Are you interested in learning Hadoop but did not know where to start?
If the answer is yes, then this post will walk you step by step through the learning process.
Why we decided to write this post?
The answer to the question is quite simple. There are so many EXCELLENT books, blog sites, documentation, articles and websites on Hadoop. There is abundance of good information available in the internet free of cost but they are all scattered. This makes someone new to Hadoop overwhelmed and they get lost. Don’t feel bad that you feel lost trying to learn Hadoop, most new learners feel the same way. There is no step by step approach to guide new learners to learn and cherish this amazing technology. This post is aimed to break that road block. In this post we will walk you through step by step, day by day (for 7 days) and what you will learn each day and how to do understand concepts better without getting overwhelmed.
Note to someone who is new to Hadoop
If you are new to Hadoop, don’t think of this post as read once and done. Bookmark this page. Because you will be coming back to this page again in the next coming days. Think of this as a course rather than a article or “just” another blog post. So from here on we are going to call this post a “course”. Before you decide to spend 100s or 1000s of dollars in Hadoop training spend few days on what is laid out in this post (course) and give it a honest shot. You will be amazed how much you can learn in just 7 days.
And the best part is we are only going to spend 3 to 4 hours in a day for 7 days to learn this amazing technology.
We will answer one of the foremost common questions from Hadoop learners – Do I need to be a Java (or any language) programmer to learn Hadoop? The answer is, absolutely not. Hadoop ecosystem is massive and there are so many tools in the ecosystem which does not require Java or any programming background.
This post (course) is literally for anyone who wants and willing to learn this AWESOME technology.
WHAT YOU NEED?
You would need 3 to 4 hours each day for 7 days. If you have the focus you can do a lot of topics in a day and finish the course faster. However if you are totally new to the technology we suggest you to take it slow so that you can fully understand the concepts. Give yourself enough time to do the hands-on as well.
Hadoop – The Definitive Guide (3rd Edition) by Tom White
Hadoop – The Definitive Guide (3rd Edition) by Tom White is like the bible for Hadoop. Anyone who is experienced in Hadoop have read this book at least once. If you don’t have your copy, click on the below link to get it from Amazon. This a must have for this course. We will be using this book to study the concepts and for practice. All the references made on this course is from the Hadoop Definitive Guide book.
Why buy a book when you have so many materials online and for free?
Here are few reasons why –
- Yes, there is a lot of free Information available online but as we said they are scattered all over the place. No wonder new learners are getting lost.
- You don’t want to spend months and months trying to learn Hadoop. This book has a very good structure and will keep us focus throughout the learning process.
- It’s a great book and it is certainly worth the investment.
If I need the book, why do I have to follow this day by day course? Why can’t I learn it straight from the book?
Excellent question. Hadoop – The Definitive Guide is an excellent book, it is well structured and goes to great depth in explaining Hadoop. That is why it is named rightfully as the “guide” and it is one of the books you need to have in your bookshelf if you are into distributed systems.
In this course, we won’t be asking you to go through the book chapter by chapter. We will skipping some topics and even some chapters. We will even cover topics in a different order than how it is listed in the book. Why? For new learners this book can be intimidating and if you are just starting with the technology there is no need to study the book cover to cover. We will pick the topics that helps us understand the technology without losing any key information. Once you are done with the course you will be in a great position to pick up topics where you need more help and would like to know based on the need.
P.S. All topics in the book are important and carefully structured. We will be skipping some topics from the book in this course and this is not to say that the topic is not important. The goal of this course is to give you a head start with Hadoop and the intention is not to restructure the book. You should come back to topics that we skipped on an as needed basis.
Cloudera (or any distribution) VM
Just like you can not learn how to ride a bike just by reading a book, you can not understand Hadoop fully just by reading a book or reading this course. You to need to practise to get the most out of this course.
Don’t spend time setting up a Hadoop cluster by yourself or using cygwin in your windows laptop. We see lot of new learners do this and it is a waste of time. Spend time in learning and understanding the technology and then you will be in a better position to set up a cluster of your own with very little effort.
If we are not setting up a cluster, how can we practise? Again, great question. Most companies offer Hadoop distribution offers a Hadoop Sandbox with Hadoop and all necessary Hadoop components pre installed. This is a great way to start because it will give you an excellent head start. You can download the sandbox for any of the vendors. Here is the Quick Start VM link for CDH 5.2 from Cloudera.
ABOUT THIS COURSE
Now you know what you need to start with this course. This is 7 day course. Meaning you will be learning a topic each day. So before you start with Day 1 read out what is listed in each day for 7 days and plan ahead. We have suggested the number of hours needed for each day and this will help you to plan in advance. Once you start with Day 1, stick with your plan and finish it in 7 days straight. Don’t prolong the course. Remember to practise. The course is not complete without practise. To help you to make sure that you get the most out of the course, we have included questions that you should try to answer before moving on to the next day. Answering the questions satisfactorily will give you the confidence to move on to the next day and will also identify the areas for improvement.
Now lets gets started !!!
DAY 1 – WHY HADOOP? WHY HADOOP? WHAT IS HADOOP?
No of hours needed – 1 hour
Go to Chapter 1: Meet Hadoop (Hadoop – The Definitive Guide book)
This chapter talks about Hadoop in a high level and explains the Big Data problem. The most valuable lesson from this chapter is the comparison of Hadoop with other systems like RDBMS or traditional distributed processing systems. Don’t expect to learn about HDFS and MapReduce yet.
Start reading “Data!” section. This will give you some examples of real world Big Data scenarios. for eg. Facebook hosts approximately 10 billion photos, talking up one petabyte of storage. You probably heard those scenarios 100 times already. If you feel that way, don’t waste your time move on to “A Brief History of Hadoop”.
“A Brief History of Hadoop” talks about the problems faced by Doug Cutting (cofounder of Hadoop) and Google’s solution to their Big Data problems. Timeline of Hadoop and its evolution. This is a very quick read.
Go to “Data Storage and Analysis”. This section talks about the problems with traditional storage and how the disk access speeds affects processing times. There are mentions of HDFS and MapReduce in this section but there are dedicated chapters on HDFS and MapReduce.
Now move on to “Comparison with Other Systems” section. This is a very important section as it compares Hadoop with RDBMS, Grid and Volunteer computing. Every system has its pros and cons and Hadoop is no exception. Hadoop is good at doing something and not so good in few things. Try to understand those key aspects.
Ignore the “Hadoop Releases” as the book calls out specifically in each case when it is referencing an older release.
That is it, you are done with Chapter 1 and yes we skipped few topics and that is OK. This book is like a reference guide for Hadoop. We can always come back to it on topics on an as needed basis.
TEST YOUR SELF
- What is the need for Hadoop when you have traditional RDBMS?
- Compare Hadoop vs RDBMS
- What are the technical difficulties when you are dealing with BIG datasets?
- What are the areas where Hadoop not so good at?
Day 1 is light if you have the energy and intrigued by the technology and can’t wait then move on to Day 2 right away.
Day 2 – HDFS
No of hours needed – 3 hours
Go to Chapter 3: “The Hadoop Distributed Filesystem” (Hadoop – The Definitive Guide book)
We are not entirely skipping MapReduce (chapter 2), we have just switched the order of the chapters.
Start with the introduction and “The Design of HDFS” section. Then move on to “HDFS Concepts”
When you read about “Blocks” make sure why the Block size is so large in Hadoop as compared to the 4 KB or so cluster size in traditional file systems. Understand the significance of blocks. Understand how blocks help with sequential disk reads.
Move on to “Namenodes and Datanodes”. Understand the importance and functionalities of the Namenode and the Datanode. Skip “HDFS Federation” and “HDFS High-Availability”
At this point you should know how HDFS offers fault tolerance and scalability. Move on to “The Command-Line Interface” section. Now it is time to get hands-on with some HDFS commands. Try HDFS commands in Cloudera VM. The book has listed only few commands at this point. Try more commands from the documentation
Now you understand what is HDFS and we know how to interact with HDFS using shell commands. Now it is time to understand the internals of HDFS read and write. Go to “Data Flow” section and read about “Anatomy of a File Read” and “Anatomy of a File Write”. Make sure you understand what happens behind the scenes when you read and write files to HDFS.
Go to Chapter 4: Hadoop I/O and read just the Data Integrity topic. Skip Compression and other topics under that chapter. We will get to those later.
That is it, you are done with HDFS chapter (for now). We have skipped a lot of sections but we have learned what we need at this point about HDFS. We will be coming back to this chapter at a later stage.
- How is HDFS different from other File Systems?
- Why the block size for HDFS is huge compared to traditional file systems?
- How do you get fault tolerance and reliability with HDFS?
- What is the purpose of Namenode?
- Did you try out commands?
DAY 3 – MAPREDUCE PART-1
No of hours needed – 4 hours
Now we know a lot about HDFS, we are in an excellent position to learn about MapReduce.
Go to Chapter 2: MapReduce (Hadoop – The Definitive Guide book)
Tom White has chosen to use the Weather Dataset to teach MapReduce and we feel it is a great and easy way to understand and illustrate as compared to the Word Count example. Java knowledge is essential for this chapter. If you are not a programmer, still go over this chapter. Without this chapter it is very hard to understand most of what Hadoop can offer.
Go over all the following topics and all their corresponding sub topics – A Weather Dataset, Analyzing the Data with Unix Tools, Analyzing the Data with Hadoop and Scaling Out. Once you are done with reading those topics you should understand the basics of MapReduce. Don’t try to run a MapReduce job yet. Lets understand what’s goes on behind the scenes when you run a MapReduce job first. Why? Because if the job fails you will be in a better position to troubleshoot knowing the processes that are involved in a MapReduce execution.
Go to Chapter 6: How MapReduce Works in Hadoop – The Definitive Guide book.
Read and understand the topics and theirs corresponding subtopics – Anatomy of a MapReduce Job Run, Failures.
Now run the MapReduce job in the Cloudera VM. Follow the output, check the Application Master UI and make sure the job executed successfully.
Now we have covered the basics of both HDFS and MapReduce. At this stage you should be in a position to comprehend what Hadoop can offer.
- What is the input to Mapper?
- How does a Reducer gets its input?
- What is Job Tracker & Task Tracker?
- What is Resource Manager, Application Master and Node Manager?
- YARN vs. Classic MapReduce
- How failures are handled in both Classic MapReduce and YARN
- Did you try to run a MapReduce program in Sandbox?
DAY 4 – MAPREDUCE PART-2
No of hours needed – 3 hours
Now we know the basics of MapReduce, it is time to go a little deep and learn a few more concepts. As rightfully said in the book, Shuffle and Sort is the heart of MapReduce and is where the “magic” happens. So lets go to Chapter 6 and go over the topic Shuffle and Sort. In the same chapter go over the all the sub topics under Task Execution. Pay close attention to Speculative Execution and Task JVM Reuse as they are usually asked about topic in the interviews.
Go to Chapter 5: Developing a MapReduce Application and read the full chapter.
Go to Chapter 8: MapReduce Features and read all the topics. This chapter could be a little intense for some learners. Join is a pain to implement in MapReduce and most of the time Joins are performed using tools like Pig and Hive. Make sure to understand Distributed Cache as it is used in optimizing joins.
- What are optimizations you can do in the shuffle phase?
- What is speculative execution?
- What is distributed cache and how can it be used with Joins?
- Difference between Map and Reduce side joins
DAY 5 – File Formats
No of hours needed – 4 hours
Go to Chapter 7: MapReduce Types and Formats. In this chapter pay more close attention in understanding the role of Partitioner, Input Splits and Records. Hadoop by design can handle a lot of file formats. This chapter will list out some of the most commonly used formats. Understand how each format is different and how each format will be used and move on.
Go to Chapter 4: Hadoop I/O. Go to Compression and understand the issues with Compression and MapReduce. Then go to File-Based Data Structures and read about Sequence and Map file.
Don’t worry about writing custom formats, Writables and Avro. Come back to those topics once you covered and understand all the basics and on an as needed basis.
- How are number of Mapper calculated for a job and what about number of reducers?
- Input Splits vs Blocks
- Input Splits and Records
- How to control the size and contents of Input Splits?
- What is the problem with small files problem in Hadoop and how to tackle it?
- How to deal with multiple input files in a MapReduce job?
- Run a MapReduce program against a bunch of smaller files (less than block size) and check out the number of mappers invoked
- Compress all the smaller files into one big file and run a MapReduce program against it. Now check out the number of mappers invoked
- Convert the small files into one Sequence file with Block level compression and run a MapReduce program against it. Now check out the number of mappers invoked
DAY 6 – ADMINISTRATIVE ELEMENTS
No of hours needed – 3 hours
Namenode is the most important node in your Hadoop cluster. So lets go some key configuration topics to help us deal with Namenode failure.
Go to Chapter 10: Administering Hadoop and cover the HDFS topic. This will explain all about Secondary Namenode and its functions.
Go back to Chapter 3: The Hadoop Distributed Filesystem and read HDFS Federation and HDFS High-Availability topics. These topics are primarily to protect your cluster during Namenode failures.
Go to Chapter 9: Setting Up a Hadoop Cluster and read the following 2 topics – Security and Benchmarking a Hadoop Cluster.
Go to Chapter 6: How MapReduce Works and read Job Scheduling and different types of scheduling options available in the recent versions of Hadoop.
There are several ways to configure and stand up a Hadoop cluster. If you are configuring a CDH cluster the easiest way to use Cloudera Manager on AWS. We will write a separate step by step guide to configure a AWS cluster in the coming days.
- What is the difference between Fair and Capacity scheduler?
- What is the function of the secondary Namenode?
- What can you do to protect yourself from a Secondary Namenode failure?
- Explain checkpointing process done by Secondary Namenode
- Benchmark Namenode
- Configure Capacity scheduler
DAY 7 – HADOOP FOR NON JAVA PROGRAMMERS
No of hours needed – 3 hours
Hadoop is not only for Java programmers. Non Java programmers can also use Hadoop and below topics will describe tools designed non Java programmers and inter program operability. Still go through this topic even if you are a Java programmer.
Go to Chapter 2: MapReduce and start with Hadoop Streaming and the sub topics underneath it
Go to Chapter 4. Hadoop I/O and start with Avro and the sub topics underneath it
- Run a MapReduce program using Streaming
- Create an Avro file using Java and read the file using Python
That’s it. If you have finished this course, give yourself a pat in the back and GREAT JOB !!! Comments us your experience and how we can improve the course for others who are just starting with Hadoop.
Learning Hadoop does not stop here. Next step is to learn some key tools from the Hadoop ecosystem.
WHAT’s NEXT? – TOOLS FROM HADOOP ECOSYSTEM
Here are the important tools from the Hadoop ecosystem that you should learn next. There are books available for each topic. Hadoop – The Definitive Guide has good chapters to start with on Pig, Hive, Sqoop, HBase. If you wish us to write a similar “course” for any of the tools you are interested in, please leave us a comment.
Concentrate on the tools in the order it is listed. Please note that this list is not comprehensive. But below tools are must know tools.
- Hive – Pay more close attention to optimizations to Hive queries and also through structuring the data. Focus on differences between Partitions vs Buckets etc. Also learn to use Hive with different file formats like Avro, RCFile, Sequence File. If you are a programmer you should learn more about SerDe and UDFs.
- Impala – When you learn Hive, Impala is a walk in the park.
- Pig – Everything that is mentioned for Hive applies here.
- Sqoop – Sqoop is important because database in an IT organization is unavoidable and any company who wants to invest in Hadoop will go to their database as their golden source of data to begin with. Spend time understanding incremental imports and its modes.
Next logical progression is to learn about NoSQL databases. Each database is big topic in itself but any open Hadoop developer positions will ask for some flavour of NoSQL experience. Here is our ordering of NoSQL in terms of popularity when it comes to job openings.
Finally, share the joy !!! Share it with your friends with the below social icons.
In Part-1 of How to prepare for Hadoop Interview? series we talked about what you will need to know and what are the areas you would get questions from in a Hadoop interview. In this post we are going to dive deeper and talk about each specific area and what to focus on each area.
This is absolutely the post you want to read if you have a Hadoop interview in couple of days and you are not sure where to start. Focus on the topics in the order given below. At the end we have also given weightage points for each topic so that you understand which topic you should pay more attention to. So here we go…
In any interview you should be able to answer the basic questions correctly. Difficult questions are by nature difficult to answer so you don’t get points when you don’t answer a difficult question but you certainly won’t LOSE points when you fail to answer a difficult question. But you will sure LOSE points when you fail to answer a basic question. The point we are trying to make is don’t screw up on basic conceptual questions. It will for sure have adverse effect on your interview so make sure you are strong in your basics and conceptual understanding of the various tools and technologies you used.
By the way, before you proceed make sure you have a stellar resume. Check out this post
1. YOUR PROJECT
“Tell me about your most recent Hadoop project” is the first question you will be asked. How you are going to answer this questions is going to steer the rest of the interview. We can not stress this enough. Prepare and practise a very good answer for this question. We know what you are thinking, “I work in a Hadoop environment day in and day out, why do I have to prepare an answer in advance for this question?” The honest answer is most people don’t give a very good answer to this question and struggle to give show the interviewer a big picture on why they are using Hadoop and how Hadoop is solving their problem. Don’t just say we are using Hadoop because we have to deal with terabytes of data. A well designed Oracle database can handle terabytes of data very well. Explain a very good use case and walk the interviewer through the use case showing him the challenges and how you solved them using Hadoop.
When you explain a use case from your project make sure to point out the tools used and how it was used. You can break down the tools in terms of data ingestion, data analysis etc. You can expect some follow up questions regarding your environment, architecture, tool choices etc. So know a lot about your environment. For eg. YARN setup, HA configuration, security setup etc.
Now you have made an excellent first impression by answering the “intro” question – “Tell me about your most recent Hadoop project”. Interviewer is impressed with your answer and would like to check out your technical skills. So here come the technical questions and how to prepare for them.
When you prepare for the interview, focus on the most basic questions first on any given topic. You most likely will not hear the question “What is HDFS?” it is a given that you know answer to this question. So start focussing on the important concepts. Here are some of the most frequently asked questions in HDFS. These questions will give you an idea of what to prepare when it comes to HDFS.
- How HDFS is different from other file systems?
- Why HDFS has a huge block size as opposed to a block size of 4 KB as seen in many other traditional file systems?
- What happens when one of the replica nodes failed during a write operation?
- How does HDFS deal with corruption?
- What is data locality, network topology considerations etc.?
Now you refreshed on the basic concepts. It is now time to refresh on the basic HDFS commands and also look at admin commands like fsck, dfsadmin etc. Now you have prepared for all the basic questions in HDFS it is time to prepare for the advanced concepts like
- Internal workings of the checkpointing process
- Need for Secondary Namenode
- Namenode Single Point Of Failure (SPOF)
- High Availability etc.
If you followed the above you should have a very good idea of what to prepare with respect to HDFS. After completing the preparation on HDFS move on to MapReduce.
Interviewers love MapReduce because it gives them an opportunity to ask lot of tough questions to validate the interviewee’s knowledge of Hadoop framework. Why so much emphasis on MapReduce you ask? Because it is the heart of Hadoop and this is where the MAGIC happens.
Explain Shuffle Phase. We often wonder why 8 out of 10 interviewees struggle to answer this question. Either they don’t understand the what happens during Shuffle or they know it but can’t explain it very well. Either reasons are bad. As simple and obvious as it sounds, start your MapReduce preparation with a clear conceptual understanding of the different phases in MapReduce. By clear conceptual understanding we mean “know inside and out”.
We often hear interviewees who struggle to explain MapReduce phases say something like “I work on Pig and Hive and so I am not that familiar with the internal workings of MapReduce”. Not a good answer and completely unacceptable. Our response would be something like “How can you write a efficient and optimized Pig instructions or Hive queries if you don’t understand the phases or what is going on behind the scenes in MapReduce?
So make sure you understand what goes on in Map, Reduce, Shuffle and Combiner phases. Make sure to understand how output (key value pairs) from the Mapper are passed to Reducer. There are several classes involved in a MapReduce program – InputFormat, RecordReader, OutputFormat, Partitioner etc. Know the functions and details of each of those classes and the role they play in a MapReduce program.
Some interviewers would give you a problem and ask you to explain how you would solve the problem using MapReduce. For some developers visualizing a problem in MapReduce is difficult. Try this simple problem – Explain how would you solve a classic GROUP BY sql operation in MapReduce. To answer this you have to explain what would be key and value that would be sent to the Mapper and will be outputted from the Mapper. How the Key Value pairs will be end up in its corresponding reducers and finally what happens in the Reducer.
Pay very close attention to the shuffle process and make sure you understand step by step on what happens to the Key Value pairs that are emitted from Mapper and how it gets to its corresponding Reducers. There is so much detail in the Shuffle process. Shuffle process is where the “magic” happens in MapReduce and and there a lot of ways to optimize it. Read about compression with MapReduce, tuning and optimizations to MapReduce, development and debugging process when you write a MapReduce program.
If there is only one book you want to read in Hadoop, read Hadoop: The Definitive Guide by Tom White. It has absolutely everything that is listed above with respect to basic and advanced concepts of HDFS and MapReduce.
4. PIG & HIVE
Most of the companies use Pig or Hive or other similar tools for their MapReduce processing in the cluster. Some interviewer will ask you to write Pig Latin instruction or Hive QL queries. So go over the syntax for Pig and Hive. Hive query comes very natural to most of us because it is SQL based. Where as a little more effort is involved in writing Pig instructions. Pig instructions are very simple and it is just a matter of remembering the syntax so when you are asked to write you are able to do so. Pay close attention to instructions that look alike but different in its functions for eg. SORT BY vs ORDER BY in Hive. Check out our sample Hadoop Interview questions for answer.
Hive is rich in its functionalities when compared to Pig. Make sure to understand the key concepts in Hive like Managed table vs External table, Partitions, Buckets, SerDe etc.You might get questions to design Hive tables with performance and efficiency in mind.
Give more emphasis to JOIN operations in both Pig and Hive. Join is a difficult to implement operation in plain MapReduce but Pig and Hive makes it easier to join datasets. Focus a lot in understanding the optimization techniques and hints that comes with JOIN operations.
Understand and read about different file formats and how you would use Pig and Hive to work with those file formats. Most common file formats include Sequence File, Map File, AVRO and RCFile.
Both Pig and Hive give the ability for programmer to extend functionalities using User Defined Functions (UDFs). If you are Java programmer and if you are applying for a position which require Java programming skills, you might be asked to write User Defined Functions or LOAD or STORE functions.
Here are the 2 books to start with Pig and Hive that we recommend. Programming Pig by Alan Gates and Programming Hive by Edward Capriolo
5. TROUBLESHOOTING & DEBUGGING
A good Hadoop interview is not over until you hear troubleshooting and debugging related questions. Hadoop cluster is a complicated environment and involves lot of nodes working together and this means lot of potential for things to go wrong and this mean as a developer you should be prepared for it. Consider the below two questions.
- How do you troubleshoot a failed or slow running job?
- How do you optimize a slow running job?
Both are open ended questions and there is no straight correct answer for both. They are more of a scenario based and the answer depends on the scenario. As a developer you could be working on plain MapReduce programming or Pig Instructions or Hive queries on a day to day basis and there are several procedures or steps you can implement to optimize performance with each tool.
- Learn about compression and the places where you can enable compression to optimize performance
- Learn about how the number of mappers and reducers might affect performance
- Learn about how you can change the memory settings of Mappers and Reducers can affect the performance in a positive way or in a negative way
- Learn about how you can change the memory settings for Mappers and Reducers in a MapReduce program, Pig and Hive
- Learn and understand how common instructions in Pig and queries in Hive gets translated in to MapReduce programs. Knowing this will help you in optimizing performance and fix failures.
- Learn about join optimizations in Pig and Hive
- Knowing your data is key and learn about how the file formats, structure of data can affect performance
- Learn about the shuffle process and learn about the memory settings that affects performance with in the shuffle process
Take a look at How do you debug a performance issue or a long running job? question to get an idea on how to answer the question
Your cluster can be maintained in house or with one of the cloud service provider like Amazon Web Services (AWS), Google cloud etc. It is great if you are already using AWS and familiar with it. But if you are not familiar with AWS, it is important that you make yourself familiar with AWS. At minimal be familiar with the following.
- Elastic Compute Cloud (EC2)
- Elastic Block Store (EBS)
- Elastic Map Reduce (EMR)
- Identity and Access Management (IAM)
Why knowing about AWS is important? Not all companies, especially startups have the financial bandwidth to maintain a cluster in house. So they take advantage of the cost effective solutions offered by cloud service provides like Amazon, Microsoft etc. If you are not familiar with AWS, don’t worry. AWS is very easy to learn and try it out. Go to aws.amazon.com and go over EC2 and EMR tutorials. You will be surprised to see how running a MapReduce program or setting up a cluster with EC2 instances is super simple.
Please note there will some cost involved in using AWS or any cloud services. Also note there are many cloud service providers but we stress on AWS is because, AWS is widely used in the industry for Hadoop implementations.
7. NoSQL DATABASES
A good knowledge and understanding about NoSQL databases like HBase, Cassandra, MongoDB is definitely a plus. It doesn’t matter which NoSQL databases you are trying to learn, use below steps to guide your learning process
- Understand CAP theorem. By far this article is the best there is in explaining CAP theorem so start with that.
- Check which principles from the CAP theorem your database conforms to.
- What is the level of consistency your database offers.
- Learn about column oriented vs row oriented designs.
- Understand the pros and cons of both column and row oriented designs.
- Check what is the design (row vs column) adopted by the database you are learning.
- Learn about the database architecture. The type of nodes involved, how data is partitioned etc.
- Learn about the read path.
- Learn about the write path.
- Learn about the underlying file formats like HFile in case of HBase and temporary memstores or cache involved and how they are being used in read or write path.
- Learn about compaction or asynchronous checkpointing process.
- Focus on single point of failures.
- Finally, try it out. Download a virtual image or set up a cluster to try out commands or use API to perform create, insert, select operations etc in the database.
8. OTHER TOOLS
Hadoop ecosystem is constantly evolving and evolving in a rapid pace. There are so many tools out there but it is very unrealistic for anyone (interviewer) to expect from someone to know about all the tools that is available. If time permits check out the below tools.
- Cloudera Hue
- Cloudera Impala
- Apache Spark
- Apache Storm
- Apache Kafka
- Amazon Kinesis
This is by no means a comprehensive list but it gives you an idea.
To give you an idea on which topic to pay more attention, time and effort, here is our percentage of weightage on each topic.
- Your Project – 10%
- HDFS & MapReduce – 25%
- Pig & Hive – 25%
- Troubleshooting & Debugging – 15%
- AWS – 10%
- NoSQL Databases – 10%
- Other Tools – 5%
Need more help? Our Hadoop Developer Interview guide has over 100 REAL questions from REAL interviews and you can get the guide for $29.99.
If you like this post please share it with the social button below.
First impression is everything. Making a good first impression is very key for interviews and Big Data or Hadoop interviews are no exception to this rule. For eg., lets say you come across a open position for Hadoop Developer from Apparel Inc., you are super excited about the position, you feel that the roles and responsibilities in the job requirement match exactly with what you are doing on a day to day basis. You feel confident and then you apply to this job with a resume and your Hadoop project in the resume looks something like below.
Client: Apparel Inc., Dec 2013 – Till Date Hadoop Developer Responsibilities:
- Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
- Moving data from HDFS to RDBMS and vice-versa using SQOOP.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis
- Installed and configured Hadoop cluster in Test and Production environments
- Performed both major and minor upgrades to the existing CDH cluster
- Implemented Commissioning and Decommissioning of new nodes to existing cluster.
- Analyzing/Transforming data with Hive and Pig.
Now you are expecting and anxiously waiting for a call from Apparel Inc. or at least from a vendor. 1 day, 2 days and 2 weeks pass by and still no calls. Feeling rejected already you post your Hadoop resume online in major job sites and still no interviews and you wonder why.
Resume with above project description and details lacks quality, looks sloppy and honestly boring. Understand that vendor or client doesn’t know anything about you. Their decision to consider you for the Hadoop position or even call you is solely based on resume and nothing else. Over the course of time we have seen several resumes. And believe it or not about 8 out of 10 resume are with poor quality and looks something like above.
Hadoop is so HOT in the market and resume preparation is a one time process and done right your Hadoop resume will get noticed, picked up and you will get calls non stop. I don’t want to leave you hanging any longer. The remaining of the post will list 5 actionable items or tips or things your Hadoop Resume should have that will help you get the next dream job with the dream company and a dream career you always wanted.
1. ABOUT CLIENT & USE CASE
An interviewer would like to know who you are working for and what you do. More importantly why you do what you do. Explain who your employer is. You may think everyone knows about your company or client or employer but that may not be the case. This is true for any resume not just Hadoop resume. In couple of lines tell about your employer.
People who look at your resume would need to know how and why you are using Hadoop in your project. Describe the project in detail and explain how Hadoop helps you solve your use case or problem at hand. So that anyone who looks at your Resume understands exactly why you are using Hadoop. Doing this is very important and we can not stress this enough. If you did this correctly you are half way through in getting a phone call from the recruiter.
2. YOUR HADOOP ENVIRONMENT
A Hadoop environment is something in which 100s of nodes work together in harmony. It is a like a well organized orchestra if you think about it. It is very important to give more details about your cluster like number of nodes, data volume, distribution used, tools used and versions, node configuration, special configurations like High Availability (HA), details about cloud services like AWS etc.
Why this is important? This will explain the interviewer or the hiring manager the expertise you have and depending on the size of the cluster, the volume you dealt with they can imagine the experience you have in debugging and troubleshooting issues. Also this explains that you actually more involved in project than just working on a single tool like Pig or Hive in a Hadoop cluster. It’s all about how well you can communicate what you know.
3. ROLE YOU PLAYED
If you are hiring manager, would you hire a Hadoop tester for a Hadoop developer position? You wouldn’t. Look at this post to get an idea of what the managers are wanting to know about the role you played in your Hadoop project. Now imagine that you can answer all those questions (at least in brief) in your Hadoop resume, every single interviewer/recruiter/hiring manager/vendor would like to discuss about the Hadoop openings they have. Because you have demonstrated what you know in your resume very clearly.
Keeping your day to day activities in mind you may think that you are a developer. But you might have involved in designing your cluster or installing and configuring your cluster. You might have done a Namenode restore once. Don’t be shy and be too modest. List all the things that you did and improve your chances of getting picked up by a recuriter. Many times we see resumes from referrals which are lousy and boring but when we speak to the candidates we can see their full potential and what they know about Hadoop and what they accomplish day to day. Their resume clearly don’t do justice to what they are doing and what they know. If they did not come through referrals there is no way we would have given them a call for an interview because of the lack of information in their resume.
4. TOOLS YOU USED
If you are working in a Hadoop environment you are most likely working on or familiar with more than one tool from the Hadoop ecosystem. List all the tools that you are familiar with and knowledgeable about. We are not suggesting that you add every tool out there in and around the Hadoop ecosystem to make your resume a power house. Keep in mind you will get questions from the tools that you have mentioned in your resume.
Tip: when you list the tools order them in a natural order of their function and usage.
- Start with the data ingestion tools like flume, sqoop etc.
- Then go ahead with listing the data transformation and analysis tools like Pig or Hive.
- Be sure to mention about the file formats like Sequence File, RCFile etc at this point.
- Then comes the coordination or orchestration tools like Oozie.
- Mention the tools used for troubleshooting and debugging.
- Then list the tools used for cluster management like Cloudera Manager or Apache Ambari.
- List the tools used for security for eg. Kerberos or Apache Sentry.
- BI tools like Tableau, Kibana comes next to give a perfect wrap to the list of tools you are familiar with.
- Don’t forget to mention NoSQL database like HBase, Cassandra or MongoDB.
- Also mention your experience with cloud services like EC2, EMR etc.
We are not suggesting you to mention every single tool that was mentioned above. No one will be or expected to be an expert in all the tools from the Hadoop ecosystem. Specify only the tools you work with on a day to day basis and very familiar with.
5. DEBUGGING, TROUBLESHOOTING & OTHER IMPORTANT STUFF
As a Hadoop developer or an administrator or an architect you know managing a Hadoop cluster is no easy task. More number of nodes and process equals to more chances for failure. When we interview candidates for key positions we are most interested to know about the issues they run in to in the past and how they addressed them. The answer to this question speaks a lot about their expertise. Quite honestly more issues you have seen more expert you become in a specific area. So make sure to point out the troubleshooting or the debugging you have done with your Hadoop cluster.
Finally if you have done one off stuff which are key, make sure to call them out in your resume. Here are some examples. You might have –
- Configured a High Availability in your cluster
- Configured security with Kerberos
- Restore a failed Namenode
- Migrated your cluster from one datacenter to another
- Involved in a version upgrade
- Implemented and configured FAIR scheduler.
In the beginning of the post we showed a section of a resume which we said was poor quality. This post is not complete without showing a good quality one. Again this is just to give you an idea. There is no “One Size Fits All” approach in creating a Hadoop resume.
Client: Apparel Inc., Dec 2013 – Till Date Hadoop Consultant Apparel Inc is a high end fashion online retailer with a global online presence. Apparel Inc currently hold about 20% of the market share in high end fashion sector. Apparel Inc’s website and mobile application see close to a million hits every day. One of the key things we focus at Apparel Inc. is provide a unique and personalized customer experience when the user shops at our website or using our mobile application. This means understanding customer’s likes and dislikes, shopping patterns are key. At Apparel Inc., we collect and analyzes large amounts of data from our customers 24×7 from several data points – websites, mobile apps, credit card program, loyalty program, social media and coupons redemption. Data from these data points could be structured, semi-structured and unstructured in few cases. All these data is collected, aggregated and analyzed in the Hadoop cluster to find shopping patterns, customer preferences, identify cross sell or upsell business decisions and devise targeted marketing strategies as a result improving the overall user experience in the website. Responsibilities:
- Worked on a live 60 nodes Hadoop cluster running CDH4.4
- Worked with highly unstructured and semi structured data of 90 TB in size (270 TB with replication factor of 3)
- Extracted the data from Teradata into HDFS using Sqoop.
- Created and worked Sqoop (version 1.4.3) jobs with incremental load to populate Hive External tables.
- Extensive experience in writing Pig (version 0.11) scripts to transform raw data from several data sources into forming baseline data.
- Developed Hive (version 0.10) scripts for end user / analyst requirements to perform ad hoc analysis
- Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
- Developed UDFs in Java as and when necessary to use in PIG and HIVE queries
- Experience in using Sequence files, RCFile, AVRO and HAR file formats.
- Developed Oozie workflow for scheduling and orchestrating the ETL process
- Implemented authentication using Kerberos and authentication using Apache Sentry.
- Worked with the admin team in designing and upgrading CDH 3 to CDH 4
- Good working knowledge of Amazon Web Service components like EC2, EMR, S3 etc
- Very good experience with both MapReduce 1 (Job Tracker) and MapReduce 2 (YARN) setups
- Very good experience in monitoring and managing the Hadoop cluster using Cloudera Manager.
- Good working knowledge of Cassandra
- Good Working knowledge of Tableau
Think of preparing a good Hadoop Resume as a one time investment. The time and effort that you put in creating a quality resume will definitely pay off in landing the dream career in Big Data field that you always wanted.
Wait !!! Don’t leave yet. Did you find the post to be helpful? Do you think it will benefit one of your friends? If yes, please use our social icons below to share with your friends.
Couple months ago one of our friends applied for a Hadoop Developer consulting position for a famous pharmaceutical company. I will let you guess the company :-). The hiring manager did not even look at the resume (we think) but our friend received the below questionnaire in the email. The hiring manager wanted the applicant to answer the below questions before he can consider the candidate for the position.
SMART MANAGERS ASK SMART QUESTIONS
It’s a quite a unique request but we see the point of the Hiring manager. He wanted to filter out as many candidates as possible and only willing to speak with qualified candidates for the position. It is saving both his time and the candidate’s time.
Very Smart !!!
The other hidden expectation is that if he sees some very good answers from a candidate it not only shows the candidate is qualified but also shows that he is really interested in the position and interested in working with the company.
Here are the Hadoop Questions asked by the hiring manager
- What is biggest Hadoop cluster you have worked on?
- Give us the details of your cluster – node configuration etc.?
- List your involvement in modeling, ingestion, transformations, aggregation and data access layer.
- What are the biggest challenges you have faced during implementation of Hadoop projects?
- What kind of HDFS file formats have you used?
- What was the selection criteria in deciding the file formats and explain the alternatives considered?
- What is your experience with traditional data warehouse/marts development? Be specific on activities you were involved?
- Why did your organization choose to use Hadoop?
- Describe a complex problem you have solved in Hive, Pig or Java MapReduce?
- What is your selection process in selecting a tool from the Hadoop ecosystem?
- Describe one or two performance issues you have resolved in Hadoop?
- Explain your troubleshooting process.
That was very smart of the hiring manager and an excellent way to filter our candidates. This practise is not so uncommon in the Hadoop world these days and so be prepared and don’t take this lightly. Since answers to the questions not only explains your understanding of the technology but also shows your genuine interest in the position. So spend some time in giving good answers to the questions. Most of the candidates answer the questions very lightly and don’t even get past this point.
P.S. Our friend answered all the questions very carefully, got an interview and got hired in the following week. He is now happily working full time for this company and claims it is the most rewarding job he ever had.
Wait !!! Don’t leave yet. Did you find the post to be helpful? Do you think it will benefit one of your friends? If yes, please use our social icons below to share with your friends.
3 years ago only handful of companies were using Hadoop. Now Hadoop technology has grown leaps and bounds so as its user base. Companies from almost all domains have started investing in Big Data technologies. Because of this, we see a steep increase in the number of openings in the Big Data or Hadoop space. With this increase in demand for Hadoop professionals we also see Hadoop interviews maturing a lot. We see a lot of complex, scenario based, analytical questions asked in the Hadoop interviews. It has become absolutely necessary for a beginner or an experienced Hadoop professional to be technically prepared before they show up to an interview.
HADOOP INTERVIEWS vs. OTHER INTERVIEWS
Hadoop interviews are very different from other technical interviews. If you are interviewing for a Java developer position you will come across a lot of technical questions in Java and very little about the use case of your project and the infrastructure. In Hadoop interviews equal emphasis is given to the use case, infrastructure and the technical aspects of the Hadoop related tools used by you.
TELL US ABOUT YOUR RECENT HADOOP PROJECT – GO…
The first question you most likely hear in a Hadoop interview is to talk about how you used Hadoop in your recent project. How you answer this question explains a lot about your understanding of the Hadoop framework. Have good answers to explain why your client or company decided to use Hadoop to solve key issues or use cases. Give a concrete use case and explain why it was harder to solve the problem using traditional approach like RDBMS and explain how you solved the problem using Hadoop.
KNOW ABOUT YOUR CLUSTER
Know about your cluster. Interviewers are very interested to know about your cluster and how it was managed. You will hear questions from cloud services like AWS and your familiarity with it. It is a good idea to prepare for this question in advance. Make sure you know the number of nodes, data volume, replication factor, block size, number of mappers and reducers per node, memory allocated to the mappers and reducers, distribution used etc.
TOOLS & TRICKS
Next comes your role in the project. Question like – Did you involve in the initial design and sizing of your Hadoop cluster? This is a common question even if you are not applying for an architect position. Then comes the list of Hadoop ecosystem tools you are familiar with. At this point technical questions starts to trickle down. Scenario based questions are very common at this point. Often questions are asked based on a scenario or problem which the interviewer faced in the past and will be interested to see how you solve the problem.
TROUBLESHOOTING & OPTIMIZATIONS
Troubleshooting and optimizations questions are very common in Hadoop interviews. Most interviewees unfortunately fail at this point not because they are not familiar or worked on troubleshooting and optimizations related problem but because of their lack of preparation in this topic. Take a look at our sample questions and answers from our Hadoop Developer Interview Guide to get an idea of the questions and answers asked in Hadoop Developer interviews.
From our experience when interviewees try to come up with answers on the fly it never usually works out. Practice and preparing your answers for challenging questions in advance before your Hadoop interviews is key to success.
Good Luck !!!
This is a 2 post series. Now you have an idea of how to approach a Hadoop Interview, go to How to prepare for a Hadoop Interview (Part-2)? which will walk you step by step on what to prepare and how you can structure your interview preparation process.
Wait !!! Don’t leave yet. Did you find the post to be helpful? Do you think it will benefit one of your friends? If yes, please use our social icons below to share with your friends.