Reputation: 4338
My application needs to process a couple of TB worth of tabular data. At the moment, the data is saved as several huge comma separated csv files. I can control how the files are being provided to my M/R job and I am wondering what is the preferred file format to make the job to run faster? For instance, is there any point in saving the input data as sequence files instead of the text file that I am using now? Will that make my M/R job to run noticeably faster?
Upvotes: 1
Views: 145
Reputation: 66709
From the perspective of "file format" I don't think using SequeceFile will be a great improvement over text file for csv data. If it was a single (Key,Value) pair in the CSV data, using SequenceFile over textfile would have made sense.
How ever, I am intrigued over use of RCFile (Record Columnar File) which should lend itself well for CSV like data. I have used it with hive tables and achieved some significant improvement in execution time for hive queries. I am assuming that that was due to execution efficiency in M/R since hive queries get translated to M/R programs.
Ref: http://www.ixwebhosting.mobi/2011/10/06/4823.html
Upvotes: 1