☰
|
B
|
Basic Level Interview Questions |
|
I
|
Intermediate Level Interview Questions |
|
A
|
Advanced Level Interview Questions |
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.
As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.
Yes, this is to avoid datanode failure.
If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.
Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.
HDFS supports exclusive writes only. When the first client contacts the “Namenode” to open the file for writing, the “Namenode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “Namenode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client
This happens because the “Namenode” does not have any available DataNodes.
You use the command: hadoop dfsadmin –safemode leave
Using the MapReduce program, the file can be read by splitting its blocks. But while writing as the incoming values are not yet known to the system mapreduce cannot be applied and no parallel writing is possible.
Use ‘-distcp’ command to copy,
hadoop dfsadmin -report
The namenode does not have any available DataNodes.
HDFS is not good at handling large number of small files. Because every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies approx 150 bytes So 10 million files, each using a block, would use about 3 gigabytes of memory. when we go for a billion files the memory requirement in namenode cannot be met.
The replication factor in HDFS can be modified in 2 ways
$hadoop fs –setrep –w 2 /my/test_file
Note: test_file is the filename whose replication factor will be set to 2
$hadoop fs –setrep –w 5 /my/test_dir
Note:test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5
Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.
HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.
There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
Logical division of data is known as Split while physical division of data is known as HDFS Block
The best way to copy files between HDFS clusters is by using multiple nodes and the distcp command, so the workload is shared.
There are few ways to do this. Look at the below illustration.
hadoop fs -setrep -w 5 -R hadoop-test
hadoop fs -Ddfs.replication=5 -cp hadoop-test/test.csv hadoop-test/test_with_rep5.csv
hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS. You can also modify the replication factor on a per-file basis using the Hadoop FS Shell:[[email protected] ~]$ hadoopfs –setrep –w 3 /my/fileConversely, you can also change the replication factor of all the files under a directory.
[[email protected] ~]$ hadoopfs –setrep –w 3 -R /my/dir