Reputation: 704
I have a kinda big text file (around 10GB) it fits without any trouble in to memory. My target is to convert every line to a base64 string. Currently my method takes forever and seems not to complete because it is single threaded.
while read line; do echo -n -i $line | base64 >> outputfile.txt; done < inputfile.txt
Can someone give me a hint how to do it faster? This solution creates around 100MB per hour (so finnishing time would be 100h) CPU usage is at 5% and also disk usage is very low.
Seems i got missunderstood about the control characters... So i included a sample text file, and how the output should be (chepner was correct with the chomp):
Sample Input:
Банд`Эрос
testè!?£$
``
▒``▒`
Sample Output:
user@monster ~ # head -n 5 bash-script-output.txt
0JHQsNC90LRg0K3RgNC+0YE=
dGVzdMOoIT/CoyQ=
YGA=
4paSYGDilpJg
user@monster ~ # head -n 5 perl-without-chomp.txt
0JHQsNC90LRg0K3RgNC+0YEK
dGVzdMOoIT/CoyQK
YGAK
4paSYGDilpJgCg==
user@monster ~ # head -n 5 perl-chomp.txt
0JHQsNC90LRg0K3RgNC+0YE=
dGVzdMOoIT/CoyQ=
YGA=
4paSYGDilpJg
So samples are everytime better then human declarations ;=)
Upvotes: 3
Views: 10473
Reputation: 1
Don't use perl, or any of the other dyn typed languages, to process 10G of text, especially if: constrained to serial processing, expect the source payload to increase over time and/or have some SLA around processing time.
If order doesn't matter, then definitely bypass the high-level language approach, because you can process in parallel for free, just using shell and posix components.
$ printf "%s\n" one two three
one
two
three
$ printf "%s\n" one two three \
> | xargs \
> -P3 `# three parallel processes` \
> -L1 `# use one line from stdin` \
> -- sh -c 'echo $@ | base64' _
b25lCg==
dHdvCg==
dGhyZWUK
Even if order (as read, as processed, as written) is a constraint, I would still take advantage of the available multiple cores and fan-out the work to multiple handlers and then fan-in to some reducer-like, single process.
# add line number to each line
$ printf "%s\n" one two three | nl
1 one
2 two
3 three
# base64 encode second column
$ printf "%s\n" one two three \
> | nl \
> | xargs -P3 -L1 sh -c \
> 'echo $2 | base64 | xargs printf "%s %s\n" "$1"' _
2 dHdvCg==
1 b25lCg==
3 dGhyZWUK
# sort based on numeric value of first col
$ printf "%s\n" one two three \
> | nl \
> | xargs -P3 -L1 sh -c \
> 'echo $2 | base64 | xargs printf "%s %s\n" "$1"' _ \
> | sort -k1 -n
1 b25lCg==
2 dHdvCg==
3 dGhyZWUK
All of these approaches will scale to the number of available cores and all of the heavy lifting, in terms of text processing, are being done by ancient, c binaries, which will outperform anything else.
If you are a satist, do the whole thing in C, but i can promise that the above will outperform anything written in perl, python, ruby, et al. The kernel will manage buffers between pipes, which means most of the esoteric, awful work, is done.
Upvotes: 0
Reputation: 531848
It might help a little to open the output file only once:
while IFS= read -r line; do echo -n $line | base64; done < inputfile.txt > outputfile.txt
bash
is not a good choice here, however, for two reasons: iterating over a file is slow to begin with, and you are starting a new process for each line. A better idea is to use a language that has a library for computing base64 values, so that everything is handled in one process. An example using Perl
perl -MMIME::Base64 -ne 'print encode_base64($_)' inputfile.txt > outputfile.txt
Upvotes: 4