BIG
DATA

JAVA

Hadoop - HDFS Interview Questions Part-3

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 
B

Basic Level Interview Questions

I

Intermediate Level Interview Questions

A

Advanced Level Interview Questions


A What is throughput? How does HDFS get a good throughput?

Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.

A What is streaming access?

As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.

I Do we need to place 2nd and 3rd data in rack 2 only?

Yes, this is to avoid datanode failure.

A What if rack 2 and datanode fails?

If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.

A What is ‘Key value pair’ in HDFS?

Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

A What happens when two clients try to access the same file on the HDFS?

HDFS supports exclusive writes only. When the first client contacts the “Namenode” to open the file for writing, the “Namenode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “Namenode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client

A Why do we sometimes get a “file could only be replicated to 0 nodes, instead of 1” error?

This happens because the “Namenode” does not have any available DataNodes.

A How does one switch off the “SAFEMODE” in HDFS?

You use the command: hadoop dfsadmin –safemode leave

A Why is it that in HDFS, ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?

Using the MapReduce program, the file can be read by splitting its blocks. But while writing as the incoming values are not yet known to the system mapreduce cannot be applied and no parallel writing is possible.

A Copy a directory from one node in the cluster to another

Use ‘-distcp’ command to copy,

A Is there a hdfs command to see available free space in hdfs

hadoop dfsadmin -report

A What does "file could only be replicated to 0 nodes, instead of 1" mean?

The namenode does not have any available DataNodes.

A What are Problems with small files and HDFS?

HDFS is not good at handling large number of small files. Because every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies approx 150 bytes So 10 million files, each using a block, would use about 3 gigabytes of memory. when we go for a billion files the memory requirement in namenode cannot be met.

A How can you overwrite the replication factors in HDFS?

The replication factor in HDFS can be modified in 2 ways

  • Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command
    $hadoop fs –setrep –w 2 /my/test_file 
    

    Note: test_file is the filename whose replication factor will be set to 2

  • Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-
    $hadoop fs –setrep –w 5 /my/test_dir 
    

    Note:test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5

A Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.

Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.

I What is the process to change the files at arbitrary locations in HDFS?

HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.

I What happens to a NameNode that has no data?

There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.

I Explain what is difference between an Input Split and HDFS Block?

Logical division of data is known as Split while physical division of data is known as HDFS Block

I Mention what is the best way to copy files between HDFS clusters?

The best way to copy files between HDFS clusters is by using multiple nodes and the distcp command, so the workload is shared.

A How do you overwrite replication factor?

There are few ways to do this. Look at the below illustration.

hadoop fs -setrep -w 5 -R hadoop-test
hadoop fs -Ddfs.replication=5 -cp hadoop-test/test.csv hadoop-test/test_with_rep5.csv

A How to configure Replication Factor in HDFS?

hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS. You can also modify the replication factor on a per-file basis using the Hadoop FS Shell:[[email protected] ~]$ hadoopfs –setrep –w 3 /my/fileConversely, you can also change the replication factor of all the files under a directory.

[[email protected] ~]$ hadoopfs –setrep –w 3 -R /my/dir