syko
syko

Reputation: 3637

Concatenating multiple text files into one very large file in HDFS

I have the multiple text files. The total size of them exceeds the largest disk size available to me (~1.5TB)

A spark program reads a single input text file from HDFS. So I need to combine those files into one. (I cannot re-write the program code. I am given only the *.jar file for execution)

Does HDFS have such a capability? How can I achieve this?

Upvotes: 1

Views: 2360

Answers (3)

jedijs
jedijs

Reputation: 563

You can do a pig job:

A = LOAD '/path/to/inputFiles' as (SCHEMA);
STORE A into '/path/to/outputFile';

Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will degradate your network

Upvotes: 1

maxteneff
maxteneff

Reputation: 1531

HDFS by itself does not provide such capabilities. All out-of-the-box features (like hdfs dfs -text * with pipes or FileUtil's copy methods) use your client server to transfer all data.

In my experience we always used our own written MapReduce jobs to merge many small files in HDFS in distributed way.

So you have two solutions:

  1. Write your own simple MapReduce/Spark job to combine text files with your format.
  2. Find already implemented solution for such kind of purposes.

About solution #2: there is the simple project FileCrush for combining text or sequence files in HDFS. It might be suitable for you, check it.

Example of usage:

hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728  \
  --input-format=text  \
  --output-format=text \
  --compress=none \
  /input/dir /output/dir 20161228161647 

I had a problem to run it without these options (especially -Ddfs.block.size and output file date prefix 20161228161647) so make sure you run it properly.

Upvotes: 1

piyush pankaj
piyush pankaj

Reputation: 755

What I understood from your question is you want to Concatenate multiple files into one. Here is a solution which might not be the most efficient way of doing it but it works. suppose you have two files: file1 and file2 and you want to get a combined file as ConcatenatedFile .Here is the script for that.

hadoop fs -cat /hadoop/path/to/file/file1.txt /hadoop/path/to/file/file2.txt | hadoop fs -put - /hadoop/path/to/file/Concatenate_file_Folder/ConcatenateFile.txt

Hope this helps.

Upvotes: 1

Related Questions