DarqMoth
DarqMoth

Reputation: 603

Hadoop: tools to concatenate CSV files in HDFS?

I have several huge CSV files with the same structure stored in HDFS. Are there any tools to concatenate these files into a single CSV file?

Upvotes: 4

Views: 4320

Answers (2)

Manoj S
Manoj S

Reputation: 66

You can use a very simple pig job to do this.

A = LOAD '/path/to/csv/files/*.csv' as (SCHEMA);
STORE A into '/path/to/output';

But remember, the output of any mapreduce job (including pig) will be in the form of part file"s".

@Donald: I agree with your second option. (using identity mapper and reducer). The only catch is, the output will be sorted by Key and we have no control over this sorting.

But i don't agree with this.

hadoop fs -cat myfiles/*.csv | hadoop fs -put - myfiles_together.csv The requirement is to concatenate several huge csv files. Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will definitely choke the network and the client node.

Upvotes: 2

Donald Miner
Donald Miner

Reputation: 39903

hadoop fs -cat myfiles/*.csv | hadoop fs -put - myfiles_together.csv

This concatenates the file contents and then passes it back up into HDFS via put. The - in there says to use standard in as the file contents, not some local file. This is better than pulling down and then pushing up because it doesn't use disk.

So, you might say "Hey! that's not scalable!" Well, unfortunately for you there is no scalable way to write out one large file in HDFS. You have to write that single file sequentially in a single thread. My basic argument is that you are going to be bottlenecked by writing your single new file, so distributing the reading of the data or anything tricky like that is going to be moot.


There is another way:

Write a MapReduce job that uses the identity mapper and identity reducer (the defaults). Set the number of reducers to 1. This will funnel all the data down into one reducer, which will then write out one file.

This has the downside of shuffling the records and not preserving record order... which probably doesn't matter.

It also has the downside of being a MapReduce job. There will be significant overhead in comparison to something simpler above.

Upvotes: 4

Related Questions