Reputation: 11
I am having an issue with MapReduce. I had to read multiple CSV files.
1 CSV file outputs 1 single row.
I cannot split the CSV files in custom input format as the rows in the CSV files are not in the same format. For example:
row 1 contains A, B, C row 2 contains D, E, F
my output value should be like A, B, D, F
I have 1100 CSV files so 1100 splits are created and hence 1100 Mappers are created. The mappers are very simple and they shouldn't take much time to process.
But the 1100 input files take a lot of time to process.
Can anyone please guide me what I can take a look at or if I am doing anything wrong in this approach?
Upvotes: 1
Views: 367
Reputation: 790
Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block.) The technical reasons for this are well explained in this Cloudera blog post
Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
You can refer this link to get methods to solve this issue
Upvotes: 1