In Part-1 of How to prepare for Hadoop Interview? series we talked about what you will need to know and what are the areas you would get questions from in a Hadoop interview. In this post we are going to dive deeper and talk about each specific area and what to focus on each area.
This is absolutely the post you want to read if you have a Hadoop interview in couple of days and you are not sure where to start. Focus on the topics in the order given below. At the end we have also given weightage points for each topic so that you understand which topic you should pay more attention to. So here we go…
In any interview you should be able to answer the basic questions correctly. Difficult questions are by nature difficult to answer so you don’t get points when you don’t answer a difficult question but you certainly won’t LOSE points when you fail to answer a difficult question. But you will sure LOSE points when you fail to answer a basic question. The point we are trying to make is don’t screw up on basic conceptual questions. It will for sure have adverse effect on your interview so make sure you are strong in your basics and conceptual understanding of the various tools and technologies you used.
By the way, before you proceed make sure you have a stellar resume. Check out this post
1. YOUR PROJECT
“Tell me about your most recent Hadoop project” is the first question you will be asked. How you are going to answer this questions is going to steer the rest of the interview. We can not stress this enough. Prepare and practise a very good answer for this question. We know what you are thinking, “I work in a Hadoop environment day in and day out, why do I have to prepare an answer in advance for this question?” The honest answer is most people don’t give a very good answer to this question and struggle to give show the interviewer a big picture on why they are using Hadoop and how Hadoop is solving their problem. Don’t just say we are using Hadoop because we have to deal with terabytes of data. A well designed Oracle database can handle terabytes of data very well. Explain a very good use case and walk the interviewer through the use case showing him the challenges and how you solved them using Hadoop.
When you explain a use case from your project make sure to point out the tools used and how it was used. You can break down the tools in terms of data ingestion, data analysis etc. You can expect some follow up questions regarding your environment, architecture, tool choices etc. So know a lot about your environment. For eg. YARN setup, HA configuration, security setup etc.
Now you have made an excellent first impression by answering the “intro” question – “Tell me about your most recent Hadoop project”. Interviewer is impressed with your answer and would like to check out your technical skills. So here come the technical questions and how to prepare for them.
When you prepare for the interview, focus on the most basic questions first on any given topic. You most likely will not hear the question “What is HDFS?” it is a given that you know answer to this question. So start focussing on the important concepts. Here are some of the most frequently asked questions in HDFS. These questions will give you an idea of what to prepare when it comes to HDFS.
- How HDFS is different from other file systems?
- Why HDFS has a huge block size as opposed to a block size of 4 KB as seen in many other traditional file systems?
- What happens when one of the replica nodes failed during a write operation?
- How does HDFS deal with corruption?
- What is data locality, network topology considerations etc.?
Now you refreshed on the basic concepts. It is now time to refresh on the basic HDFS commands and also look at admin commands like fsck, dfsadmin etc. Now you have prepared for all the basic questions in HDFS it is time to prepare for the advanced concepts like
- Internal workings of the checkpointing process
- Need for Secondary Namenode
- Namenode Single Point Of Failure (SPOF)
- High Availability etc.
If you followed the above you should have a very good idea of what to prepare with respect to HDFS. After completing the preparation on HDFS move on to MapReduce.
Interviewers love MapReduce because it gives them an opportunity to ask lot of tough questions to validate the interviewee’s knowledge of Hadoop framework. Why so much emphasis on MapReduce you ask? Because it is the heart of Hadoop and this is where the MAGIC happens.
Explain Shuffle Phase. We often wonder why 8 out of 10 interviewees struggle to answer this question. Either they don’t understand the what happens during Shuffle or they know it but can’t explain it very well. Either reasons are bad. As simple and obvious as it sounds, start your MapReduce preparation with a clear conceptual understanding of the different phases in MapReduce. By clear conceptual understanding we mean “know inside and out”.
We often hear interviewees who struggle to explain MapReduce phases say something like “I work on Pig and Hive and so I am not that familiar with the internal workings of MapReduce”. Not a good answer and completely unacceptable. Our response would be something like “How can you write a efficient and optimized Pig instructions or Hive queries if you don’t understand the phases or what is going on behind the scenes in MapReduce?
So make sure you understand what goes on in Map, Reduce, Shuffle and Combiner phases. Make sure to understand how output (key value pairs) from the Mapper are passed to Reducer. There are several classes involved in a MapReduce program – InputFormat, RecordReader, OutputFormat, Partitioner etc. Know the functions and details of each of those classes and the role they play in a MapReduce program.
Some interviewers would give you a problem and ask you to explain how you would solve the problem using MapReduce. For some developers visualizing a problem in MapReduce is difficult. Try this simple problem – Explain how would you solve a classic GROUP BY sql operation in MapReduce. To answer this you have to explain what would be key and value that would be sent to the Mapper and will be outputted from the Mapper. How the Key Value pairs will be end up in its corresponding reducers and finally what happens in the Reducer.
Pay very close attention to the shuffle process and make sure you understand step by step on what happens to the Key Value pairs that are emitted from Mapper and how it gets to its corresponding Reducers. There is so much detail in the Shuffle process. Shuffle process is where the “magic” happens in MapReduce and and there a lot of ways to optimize it. Read about compression with MapReduce, tuning and optimizations to MapReduce, development and debugging process when you write a MapReduce program.
If there is only one book you want to read in Hadoop, read Hadoop: The Definitive Guide by Tom White. It has absolutely everything that is listed above with respect to basic and advanced concepts of HDFS and MapReduce.
4. PIG & HIVE
Most of the companies use Pig or Hive or other similar tools for their MapReduce processing in the cluster. Some interviewer will ask you to write Pig Latin instruction or Hive QL queries. So go over the syntax for Pig and Hive. Hive query comes very natural to most of us because it is SQL based. Where as a little more effort is involved in writing Pig instructions. Pig instructions are very simple and it is just a matter of remembering the syntax so when you are asked to write you are able to do so. Pay close attention to instructions that look alike but different in its functions for eg. SORT BY vs ORDER BY in Hive. Check out our sample Hadoop Interview questions for answer.
Hive is rich in its functionalities when compared to Pig. Make sure to understand the key concepts in Hive like Managed table vs External table, Partitions, Buckets, SerDe etc.You might get questions to design Hive tables with performance and efficiency in mind.
Give more emphasis to JOIN operations in both Pig and Hive. Join is a difficult to implement operation in plain MapReduce but Pig and Hive makes it easier to join datasets. Focus a lot in understanding the optimization techniques and hints that comes with JOIN operations.
Understand and read about different file formats and how you would use Pig and Hive to work with those file formats. Most common file formats include Sequence File, Map File, AVRO and RCFile.
Both Pig and Hive give the ability for programmer to extend functionalities using User Defined Functions (UDFs). If you are Java programmer and if you are applying for a position which require Java programming skills, you might be asked to write User Defined Functions or LOAD or STORE functions.
Here are the 2 books to start with Pig and Hive that we recommend. Programming Pig by Alan Gates and Programming Hive by Edward Capriolo
5. TROUBLESHOOTING & DEBUGGING
A good Hadoop interview is not over until you hear troubleshooting and debugging related questions. Hadoop cluster is a complicated environment and involves lot of nodes working together and this means lot of potential for things to go wrong and this mean as a developer you should be prepared for it. Consider the below two questions.
- How do you troubleshoot a failed or slow running job?
- How do you optimize a slow running job?
Both are open ended questions and there is no straight correct answer for both. They are more of a scenario based and the answer depends on the scenario. As a developer you could be working on plain MapReduce programming or Pig Instructions or Hive queries on a day to day basis and there are several procedures or steps you can implement to optimize performance with each tool.
- Learn about compression and the places where you can enable compression to optimize performance
- Learn about how the number of mappers and reducers might affect performance
- Learn about how you can change the memory settings of Mappers and Reducers can affect the performance in a positive way or in a negative way
- Learn about how you can change the memory settings for Mappers and Reducers in a MapReduce program, Pig and Hive
- Learn and understand how common instructions in Pig and queries in Hive gets translated in to MapReduce programs. Knowing this will help you in optimizing performance and fix failures.
- Learn about join optimizations in Pig and Hive
- Knowing your data is key and learn about how the file formats, structure of data can affect performance
- Learn about the shuffle process and learn about the memory settings that affects performance with in the shuffle process
Take a look at How do you debug a performance issue or a long running job? question to get an idea on how to answer the question
Your cluster can be maintained in house or with one of the cloud service provider like Amazon Web Services (AWS), Google cloud etc. It is great if you are already using AWS and familiar with it. But if you are not familiar with AWS, it is important that you make yourself familiar with AWS. At minimal be familiar with the following.
- Elastic Compute Cloud (EC2)
- Elastic Block Store (EBS)
- Elastic Map Reduce (EMR)
- Identity and Access Management (IAM)
Why knowing about AWS is important? Not all companies, especially startups have the financial bandwidth to maintain a cluster in house. So they take advantage of the cost effective solutions offered by cloud service provides like Amazon, Microsoft etc. If you are not familiar with AWS, don’t worry. AWS is very easy to learn and try it out. Go to aws.amazon.com and go over EC2 and EMR tutorials. You will be surprised to see how running a MapReduce program or setting up a cluster with EC2 instances is super simple.
Please note there will some cost involved in using AWS or any cloud services. Also note there are many cloud service providers but we stress on AWS is because, AWS is widely used in the industry for Hadoop implementations.
7. NoSQL DATABASES
A good knowledge and understanding about NoSQL databases like HBase, Cassandra, MongoDB is definitely a plus. It doesn’t matter which NoSQL databases you are trying to learn, use below steps to guide your learning process
- Understand CAP theorem. By far this article is the best there is in explaining CAP theorem so start with that.
- Check which principles from the CAP theorem your database conforms to.
- What is the level of consistency your database offers.
- Learn about column oriented vs row oriented designs.
- Understand the pros and cons of both column and row oriented designs.
- Check what is the design (row vs column) adopted by the database you are learning.
- Learn about the database architecture. The type of nodes involved, how data is partitioned etc.
- Learn about the read path.
- Learn about the write path.
- Learn about the underlying file formats like HFile in case of HBase and temporary memstores or cache involved and how they are being used in read or write path.
- Learn about compaction or asynchronous checkpointing process.
- Focus on single point of failures.
- Finally, try it out. Download a virtual image or set up a cluster to try out commands or use API to perform create, insert, select operations etc in the database.
8. OTHER TOOLS
Hadoop ecosystem is constantly evolving and evolving in a rapid pace. There are so many tools out there but it is very unrealistic for anyone (interviewer) to expect from someone to know about all the tools that is available. If time permits check out the below tools.
- Cloudera Hue
- Cloudera Impala
- Apache Spark
- Apache Storm
- Apache Kafka
- Amazon Kinesis
This is by no means a comprehensive list but it gives you an idea.
To give you an idea on which topic to pay more attention, time and effort, here is our percentage of weightage on each topic.
- Your Project – 10%
- HDFS & MapReduce – 25%
- Pig & Hive – 25%
- Troubleshooting & Debugging – 15%
- AWS – 10%
- NoSQL Databases – 10%
- Other Tools – 5%
Need more help? Our Hadoop Developer Interview guide has over 100 REAL questions from REAL interviews and you can get the guide for $29.99.
If you like this post please share it with the social button below.