Reputation: 4338

Suggested Hadoop File Format for Tabular Data

My application needs to process a couple of TB worth of tabular data. At the moment, the data is saved as several huge comma separated csv files. I can control how the files are being provided to my M/R job and I am wondering what is the preferred file format to make the job to run faster? For instance, is there any point in saving the input data as sequence files instead of the text file that I am using now? Will that make my M/R job to run noticeably faster?

Upvotes: 1

Answers (1)

pyfunc

Reputation: 66709

From the perspective of "file format" I don't think using SequeceFile will be a great improvement over text file for csv data. If it was a single (Key,Value) pair in the CSV data, using SequenceFile over textfile would have made sense.

How ever, I am intrigued over use of RCFile (Record Columnar File) which should lend itself well for CSV like data. I have used it with hive tables and achieved some significant improvement in execution time for hive queries. I am assuming that that was due to execution efficiency in M/R since hive queries get translated to M/R programs.

Ref: http://www.ixwebhosting.mobi/2011/10/06/4823.html

Upvotes: 1

Suggested Hadoop File Format for Tabular Data

Answers (1)

Related Questions