Hadoop file size Clarification

Question

I am having clarification regarding using Hadoop for large file size around 2 million. I have file data that consists of 2 million lines for which I want to split each line as single file, copy it in Hadoop File System and do perform calculation of term frequency using Mahout. Mahout uses map-reduce computation in a distributed fashion. But for this, say If I have a file that consist of 2 million lines, I want to take each line as a document for calculation of term-frequency. I will finally have one directory where I will have 2 million documents, each document consist of single line. Will this create n-maps for n-files, here 2 million maps for the process. This takes lot of time for computation. Is there is any alternative way of representing documents for faster computation.

David Gruzman · Accepted Answer

2 millions files is a lot for hadoop. More then that - running 2 million tasks will have roughly 2M seconds overhead, what means a few days of small cluster work. I think that the problem is of algorithmic nature - how to map your computation to the map reduce paradigm in the way that you will have modest number of mappers. Please drop a few lines about task you need, and I might suggest algorithm.

Hadoop file size Clarification

Answers (2)

Related Questions