Reputation: 39

Optimisation for hadoop put with lot of small files

I have to load a lot of files on my cluster (+/- 500 000) and it's take a very long time. Each file is in gzip format and takes 80Mb of space.

For the moment I use a while loop for load my file with a put but you have maybe a best solution...

Thanks for your helping.

Upvotes: 1

Answers (3)

Binary01

Reputation: 695

May be you can look into DataLoader of PivotalHD which loads data using map job parallel which is faster. Check this link PivotalHD Dataloader.

Upvotes: 1

Sanjay Bhosale

Reputation: 685

You can use BuildSequenceFileFromDir of Binarypig present at https://github.com/endgameinc/binarypig

Upvotes: 0

Viacheslav Rodionov

Reputation: 2345

It's hard to understand the problem the way you explain it.

HDFS supports gzip compression without splitting. As your files are ~80MB each then splitting is not a big problem for you, just make sure to use block size of 128MB of larger.

Concerning file uploading, why don't you upload the whole directory simply with -put command?

hadoop fs -put local/path/to/dir path/in/hdfs

will do the trick.

Upvotes: 3

Optimisation for hadoop put with lot of small files

Answers (3)

Related Questions