Reputation: 357
I'm trying to load gzipped files from a directory on a remote machine onto the HDFS of my local machine. I want to be able to read the gzipped files from the remote machine and pipe them directly into the HDFS on my local machine. This is what I've got on the local machine:
ssh remote-host "cd /files/wanted; tar -cf - *.gz" | tar -xf - | hadoop fs -put - "/files/hadoop"
This apparently copies all of the gzipped files from the remote path specified to the path where I execute the command and loads an empty file -
into the HDFS. The same thing happens if I try it without tar
also:
ssh remote-host "cd /files/wanted; cat *.gz" | hadoop fs -put - "/files/hadoop"
Just for shits and giggles to see if I was maybe missing something simple, I tried the following on my local machine:
tar -cf - *.gz | tar -xf -C tmp
This did what I expected, it took all of the gzipped files in the current directory and put them in an existing directory tmp
.
Then with the Hadoop part on the local machine:
cat my_file.gz | hadoop fs -put - "/files/hadoop"
This also did what I expected, it put my gzipped file into /files/hadoop
on the HDFS.
Is it not possible to pipe multiple files into HDFS?
Upvotes: 2
Views: 4479
Reputation: 357
For whatever reason, I can't seem to pipe multiple files into HDFS. So what I ended up doing was creating a background SSH session so I don't have to create one for every single file I want to load:
ssh -fNn remote-host
And then iterating over the list of files I need to load into HDFS and pipe each one in:
for file in /files/wanted/*; do
ssh -n remote-host "cat $file" | "hadoop fs -put - /files/hadoop/$file"
done
Also make sure to close the SSH session:
ssh -O exit remote-host
Upvotes: 3