Reputation: 39
I have to load a lot of files on my cluster (+/- 500 000) and it's take a very long time. Each file is in gzip format and takes 80Mb of space.
For the moment I use a while loop for load my file with a put but you have maybe a best solution...
Thanks for your helping.
Upvotes: 1
Views: 489
Reputation: 695
May be you can look into DataLoader of PivotalHD which loads data using map job parallel which is faster. Check this link PivotalHD Dataloader.
Upvotes: 1
Reputation: 685
You can use BuildSequenceFileFromDir of Binarypig present at https://github.com/endgameinc/binarypig
Upvotes: 0
Reputation: 2345
It's hard to understand the problem the way you explain it.
HDFS supports gzip compression without splitting. As your files are ~80MB each then splitting is not a big problem for you, just make sure to use block size of 128MB of larger.
Concerning file uploading, why don't you upload the whole directory simply with -put command?
hadoop fs -put local/path/to/dir path/in/hdfs
will do the trick.
Upvotes: 3