earl
earl

Reputation: 768

Merge multiple files recursively in HDFS

My folder path structure in HDFS is something like this:

/data/topicname/year=2017/month=02/day=28/hour=00
/data/topicname/year=2017/month=02/day=28/hour=01
/data/topicname/year=2017/month=02/day=28/hour=02
/data/topicname/year=2017/month=02/day=28/hour=03

Inside these paths I have many small size json files. I am writing a shell script which can merge all files present inside all these individual directories into a single individual filename depending on path.

Example:

All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=00 into one merged file full_2017_02_28_00.json

All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=01 into one merged file full_2017_02_28_01.json

All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=02 into one merged file full_2017_02_28_02.json and so on.

Keeping the file name in the above said pattern is secondary job which I will try to achieve. Currently I can hardcode the filenames.

But, recursive concatenation inside directory path structure is not happening.

So far, I have tried below:

hadoop fs -cat /data/topicname/year=2017/* | hadoop fs -put - /merged/test1.json

Error:-

cat: `/data/topicname/year=2017/month=02/day=28/hour=00': Is a directory
cat: `/data/topicname/year=2017/month=02/day=28/hour=01': Is a directory
cat: `/data/topicname/year=2017/month=02/day=28/hour=02': Is a directory

Recursive cat is not happening in above try

hadoop fs -ls /data/topicname/year=2017/month=02 | find /data/topicname/year=2017/month=02/day=28 -name '*.json' -exec cat {} \; > output.json

Error:-

find: ‘/data/topicname/year=2017/month=02/day=28’: No such file or directory

It is doing find in local FS instead of HDFS in above attempt

for i in `hadoop fs -ls -R /data/topicname/year=2017/ | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - /merged/output.json`; done

Error:-

cannot write output to stream message is repeated multiple times
file /merged/output.json is repeated a few times

How is this achievable? I do not want to use Spark.

Upvotes: 2

Views: 3088

Answers (2)

earl
earl

Reputation: 768

I was able to achieve my goal with below script:

#!/bin/bash

for k in 01 02 03 04 05 06 07 08 09 10 11 12
do
        for j in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
        do
                for i in 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
                do

                hadoop fs -cat /data/topicname/year=2017/month=$k/day=$j/hour=$i/* | hadoop fs -put - /merged/TEST1/2017"_"$k"_"$j"_"$i.json
                hadoop fs -du -s /merged/TEST1/2017"_"$k"_"$j"_"$i.json > /home/test/sizetest.txt
                x=`awk '{ print $1 }' /home/test/sizetest.txt`
                echo $x
                if [ $x -eq 0 ]
                then
                hadoop fs -rm /merged/TEST1/2017"_"$k"_"$j"_"$i.json
                else
                echo "MERGE DONE!!! All files generated at hour $i of $j-$k-2017 merged into one"
                echo "DELETED 0 SIZED FILES!!!!"
                fi

                done
        done
done

rm -f /home/test/sizetest.txt
hadoop fs -rm -r /data/topicname

Upvotes: -1

franklinsijo
franklinsijo

Reputation: 18290

Use -appendToFile:

for file in `hdfs dfs -ls -R /src_folder | awk '$2!="-" {print $8}'`; do hdfs dfs -cat $file | hdfs dfs -appendToFile - /target_folder/filename;done

Time taken will be dependent on the number and size of files as the process is sequential.

Upvotes: 1

Related Questions