Reputation: 21
I have 4 files on hdfs.
1.txt,2.txt,3.txt and 4.txt. Out of this 4 files the first 3 files has data contents as below and 4.txt file is empty. How may mappers are executed.
Number of mappers = number of input splits.
My question is, are all this files stored in one 64 MB block or 4 different blocks? since the data is less than 64MB in size for each file.
1.txt This is text file 1
2.txt This is text file 2
3.txt This is text file 3
4.txt "Empty"
Upvotes: 0
Views: 162
Reputation: 4751
HDFS by default do not combine small files into one single block. HDFS will store all files in seperate blocks so your HDFS will use 4blocks to store your 4files (each smaller than dfs.block.size). This do not mean than HDFS will occupy 4*64MB of size.Hence your MR job will spawn 4 Mappers to read all files
Ideally, you should not store small files on HDFS as it will increase load on Namenode.
You can combine the files before uploading to HDFS with unix utility or convert files to sequence files or write pig script/hive script/mapreduce to combine all small files into bigger files. Small files on HDFS are described very well here : http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
Upvotes: 0
Reputation: 2231
It would be stored in 4 different blocks unless and until you wrap it up and store in a HAR file. The concept is if your file size is more than the block size then your single file would be split and stored in different blocks, else if it is less than the block size then the files would be stored independently in different blocks. But however it would not use more than the actual file size even if the block size is 64 MB or more than that. Quoting from The Definitive Guide:
HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode.
So in your case it would still use 4 mappers as we have 4 blocks.
Upvotes: 2