Reputation: 800
I want to encrypt and decrypt big files (think 20m lines) of text. The encryption service I am using can only encrypt a maximum size of 64kb. For the purposes of this question assume we are stuck with this service.
My solution is to split the huge file in chunks of 64kb, encrypt all of them in parallel and put the encrypted parts in a tar.gz
. Each part is numbered as part-xxx
to make sure I can restore the original file. At decryption time I unzip, decrypt each part in parallel and concat results in order.
The fun part: When I do that last part on a big enough file one of the following happens:
The tmux sessions dies, and i get logged out. No logs, no nothing.
I get this:
/home/estergiadis/kms/decrypt.sh: line 45: /usr/bin/find: Argument list too long
/home/estergiadis/kms/decrypt.sh: line 46: /bin/rm: Argument list too long
I tried several solutions based on xargs with no luck. Here is the interesting code:
echo "Decrypting chunks in parallel."
# -1 -f in ls helped me go from scenario 1 to scenario 2 above.
# Makes sense since I don't need sorting at this stage.
ls -1 -f part-* | xargs -I % -P 32 bash -c "gcloud kms decrypt --ciphertext-file % --plaintext-file ${OUTPUT}.%"
# Best case scenario, we die here
find $OUTPUT.part-* | xargs cat > $OUTPUT
rm $OUTPUT.part-*
Even more interesting: when find and rm report a problem, I can go to the temp folder with all parts, run the exact same commands myself and everything works.
In case it matters, all of this takes place in a RAM mounted filesystem. However RAM cannot possibly be the issue: I am on a machine with 256GB RAM, the files involved take up 1-2GB and htop
never shows more than 10% usage.
Upvotes: 1
Views: 95
Reputation: 16819
Your problem is with these:
ls -1 -f part-* | ...
find $OUTPUT.part-* | ...
rm $OUTPUT.part-*
If you have too many parts (part-*
, etc), the filename expansion done by the shell will result in a command with too many arguments or you may exceed the maximum command length.
find
+ xargs
allows you to overcome this. You can replace any command that uses a glob to list files in the current directory with, for example:
find . -name GLOB -print -o ! -path . -prune | xargs CMD
The -o ! -path . -prune
tells find
to not descend into subdirectories. xargs
ensures the generated commandlines do not exceed the maximum argument or line limits.
So for the three lines you could do:
globwrap(){
glob="$1"
shift
find . -name "$glob" -print -o ! -path . -prune |\
sed 's/^..//' |\
xargs "$@" # defaults to echo if no command given
}
globwrap 'part-*' | ...
globwrap "$OUTPUT"'.part-*' | ...
globwrap "$OUTPUT"'.part-*' rm
Single-quotes prevent the shell expanding the glob we are passing to find
.
sed
strips out the ./
that would otherwise be prepended to each filename.
Note that the original ls
and find
are no longer needed in the first two cases.
Upvotes: 2