Reputation: 1338
I need to write a script to copy content from a remote server to a local machine. Though there are many ways to do this, because the servers belong to different clusters (one kerborised the other not), using a simple bash script has been advised by my colleagues to stream data from the remote location to the local machine (where it is streamed in to a hdfs put command to store it on the local cluster).
The below code works for the first iteration, creating a directory, a correctly named file and adding the correct contents to it. However it stops after the first file (in this test scenario there are three files named a, b and c and the file a is successfully created in the destination server with the correct contents but files b and c aren't.
Can anyone help to understand what is going wrong in this script? I think the issue is something around my bash syntax rather than an issue with the more complex hdfs infra behiind it.
Here is the script:
lastIterationDir=""
ssh -i <private_key> <user>@<source_ip> "hdfs dfs -ls -R <sourceDr>" |
awk ' $0 !~ /^d/ {print $NF}' |
while read fileToTransfer
do
thisIterationDir=`dirname $fileToTransfer}`
if [ "$thisIterationDir" != "$lastIterationDir" -a "$thisIterationDir" != "/" ]; then
hdfs dfs -mkdir -p "/path/to/save/transfered/files${thisIterationDir}"
fi
ssh -i <private_key> <user>@<source_ip> "hdfs dfs -cat ${fileToTransfer}" |
hdfs dfs -put - "${destinationFolder}${fileToTransfer}"
lastIterationDir=$thisIterationDir
done
Upvotes: 0
Views: 142
Reputation: 1338
@pynexj pointed to a link with the root cause of the problem.
When reading a file line by line, if a command inside the loop also reads stdin, it can exhaust the input file. In this example, the line:
ssh -i <private_key> <user>@<source_ip> "hdfs dfs -cat ${fileToTransfer}" |
hdfs dfs -put - "${destinationFolder}${fileToTransfer}"
was reading the stdin within the loop reading from the stdin, effectively exhausting the input.
It seems that the solution to this resides in being explicit with the redirection of stdin and in the case of ssh, this can be done by using the -n switch.
-n Redirects stdin from /dev/null (actually, prevents reading from stdin). This must be used when ssh is run in the back‐ ground. A common trick is to use this to run X11 programs on a remote machine. For example, ssh -n shadows.cs.hut.fi emacs & will start an emacs on shadows.cs.hut.fi, and the X11 connection will be automatically forwarded over an encrypted channel. The ssh program will be put in the back‐ ground. (This does not work if ssh needs to ask for a pass‐ word or passphrase; see also the -f option.)
The script works with the -n flag added to the inner ssh command, leaving the final script as:
lastIterationDir=""
ssh -i <private_key> <user>@<source_ip> "hdfs dfs -ls -R <sourceDr>" |
awk ' $0 !~ /^d/ {print $NF}' |
while read fileToTransfer
do
thisIterationDir=`dirname $fileToTransfer}`
if [ "$thisIterationDir" != "$lastIterationDir" -a "$thisIterationDir" != "/" ]; then
hdfs dfs -mkdir -p "/path/to/save/transfered/files${thisIterationDir}"
fi
ssh -i <private_key> <user>@<source_ip> -n "hdfs dfs -cat ${fileToTransfer}" |
hdfs dfs -put - "${destinationFolder}${fileToTransfer}"
lastIterationDir=$thisIterationDir
done
Upvotes: 3