Reputation: 57
In my hdfs folder I am getting my inputs files continuously. I wanted to merge multiple csv file having same header from last 15 min and make one csv file having one header. I tried with -getmerge
but it did not work. any pointers please?
Upvotes: 1
Views: 1122
Reputation: 4674
I am referring below link to get the list of files which were processed in last '5 minutes'.
Get the list of files processed in last 5 minutes Since you want to skip individual header and merge all the listed files with single header. Can get those files to local unix as shown below:
#!/bin/bash
filenames=`hdfs dfs -ls /user/vikct001/dev/hadoop/external/csvfiles/part* | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=5;LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF < LAST){ print $3 }}' `
for file in $filenames
do
#echo $file
hdfs dfs -get ${file} /home/vikct001/user/vikrant/shellscript/testfiles
done
once you have the listed files at your local. can use below command to merge all the files with single header.
awk '(NR == 1) || (FNR > 1)' /home/vikct001/user/vikrant/shellscript/testfiles/part*.csv > bigfile.csv
Here's a link for more details on this. Merge csv with a single header
There are couple of more commands as mentioned in above link but I found this is the most suitable.
Upvotes: 2