Reputation: 286
I want to use a bash script to process 1 input file into 2 output files, each containing the same number of lines as the input file but with different parts of the input line. In particular one of the output files has to contain a md5hash of a selection of the input line, (hash calculated per line, not per file!):
So
Input_file.txt: ** 3 fields, separated by space
12347654 abcdfg 1verylongalpha1234numeric1
34543673 nvjfur 2verylongalpha1234numeric2
75868643 vbdhde 3verylongalpha1234numeric3
output file_1.txt would have to look like this: (left field is MD5sum, right field is field3 from input file which is also contained in the MD5hash):
12df5j754G75f738fjk3483df3fdf9 1verylongalpha1234numeric1
3jf75j47fh4G84ka9J884hs355jhd8 2verylongalpha1234numeric2
4hf7dn46chG4875ldgkk348fk345d9 3verylongalpha1234numeric3
output file_2.txt would have to look like this: (field1 and field2 from input file + MD5HASH)
12347654 abcdfg 12df5j754G75f738fjk3483df3fdf9
34543673 nvjfur 3jf75j47fh4G84ka9J884hs355jhd8
75868643 vbdhde 4hf7dn46chG4875ldgkk348fk345d9
I already have a script that doesthe job but it performs very badly: (script below may not work, this is from the top of my head, no linux here where I write this, sorry)
#!/bin/bash
While read line
do MD5_HASH=${sed -nr 's/^[[:digit:]]*\s[[:alpha:]]*\s([[:alnum:]]*)/\1/p' <<<$line | md5sum}
read $line DATA_PART1 DATA_PART2 DATA_PART3
echo "$MD5_HASH $DATA_PART3" >> file_1.txt ##append file_2.txt in loop THIS IS WHERE IT GETS HORRIBLY SLOW!
echo "$DATA_PART1 $DATA_PART2 $MD5_HASH"
done < input_file.txt > file_2.txt
exit 0
I think that the "redirect stdout to file with append construct" '>>' is responsible for the slow performance, but I can't think of another way. Its in the loop because I have to calculate the md5hash per line.
(and oh, the sed command is necessary because in reality the part that goes into the MD5SUM can only be captured with regex and a quite complex pattern)
So anyone have a suggestion?
Upvotes: 1
Views: 978
Reputation: 246847
Your bash script can be tidied up a bit. Note that the read
command can read the 3 fields into separate variables:
#!/bin/bash
rm -f file_1.txt file_2.txt
While read f1 f2 f3; do
hash=$(md5sum <<< $f3)
printf "%s %s\n" "$hash" "$f3" >> file_1.txt
printf "%s %s %s\n" "$f1" "$f2" "$hash" >> file_2.txt
done < input_file.txt
Upvotes: 2
Reputation: 47109
You may be able to increase the efficiency with pipes and parallel.
According to your pseudo-code, you want the md5 sum of the last element:
paste -d ' ' \
input_file.txt \
<(cut -d' ' -f3 input_file.txt | parallel echo '{}' \| md5sum | cut -d' ' -f1) |
awk '{ print $4, $3 > "file_1.txt"; print $1, $2, $4 > "file_2.txt" }'
The md5 sum is calculated in parallel in the process substitution, the output from here is "pasted" onto the original file. Finally awk takes care of placing the output into the correct files.
I agree with redShadow that this will never be very efficient in shell, as you need to sub-shell a lot. Here's an alternative in perl:
split.pl
use Digest::MD5 qw(md5_hex);
use v5.10;
open O1, ">file_1.txt" or die $!; open O2, ">file_2.txt" or die $!;
$, = " ";
while(<>) { chomp;
@F = split / +/;
$md5 = md5_hex $F[2];
say O1 $md5, $F[2];
say O2 @F[0,1], $md5;
}
close O1; close O2;
Run like this:
<input_file.txt perl split.pl
Output in both cases:
file_1.txt
765ac5d0002aed1141a6a4e7b90e4ac9 1verylongalpha1234numeric1
b31901def07d436aed2c8028b2efa4ec 2verylongalpha1234numeric2
0722a6e50f6f8726f9754e7f71f9ad2c 3verylongalpha1234numeric3
file_2.txt
12347654 abcdfg 765ac5d0002aed1141a6a4e7b90e4ac9
34543673 nvjfur b31901def07d436aed2c8028b2efa4ec
75868643 vbdhde 0722a6e50f6f8726f9754e7f71f9ad2c
Upvotes: 0
Reputation: 95267
You can write both files at the same time from bash, like this:
; function to remove extraneous filename output from md5sum. omit on
; OS X, which has 'md5' command that already works this way.
md5() { set -- $(md5sum "$@"); echo "$1"; }
exec 3>file_1.txt 4>file_2.txt
while read left middle right; do
md5="$(echo -n "$right" | md5)"
echo >&3 "$md5 $right"
echo >&4 "$left $middle $md5"
done <input_file.txt
exec 3>&- 4>&-
That assumes the simple whitespace-separated fields of your example; you would of course still have to do whatever sed
magic is required to get the actual target for the MD5 sum.
It won't be very efficient, though. For better performance, you should use something like Perl or Python, which can do both the field extraction you're using sed
for and the MD5 calculation all within a single process that is also much faster than the shell at looping over lines of input. Perl example:
perl -MDigest::MD5=md5_hex -lane '
BEGIN { open $f1, ">file_1.txt"; open $f2, ">file_2.txt" }
$md5 = md5_hex $F[2];
print $f1 "$md5 $F[2]";
print $f2 "$F[0] $F[1] $md5";
' input_file.txt
Upvotes: 0
Reputation: 6777
This is one case in which I'd use a fully-featured language, such as Python.
Although you might find a way to do this by using only the standard gnu tools, you'd very likely end up with a solution that will be:
from hashlib import md5
with open('input.txt', 'r') as infile:
for l in infile:
if not l.strip(): continue
parts = l.strip().split()
print md5(parts[2]).hexdigest(), parts[2]
from hashlib import md5
with open('input.txt', 'r') as infile:
for l in infile:
if not l.strip(): continue
parts = l.strip().split()
print parts[0], parts[1], md5(parts[2]).hexdigest()
I'm not sure about on which fields you calculated the checksum; however, of course, you can calculate it on whichever field(s) you want; you could also perform more complex regexp-based matching on the lines; and you can speed up things by outputting the two files at once, thus avoiding calculating the md5 twice.
from hashlib import md5
with open('infile.txt','r') as infile, open('out1.txt','w') as out1, open('out2.txt','w') as out2:
for l in infile:
if not l.strip(): continue
parts = l.strip().split()
_checksum = md5(parts[2]).hexdigest()
out1.write("%s\n" % " ".join([ _checksum, parts[2] ]))
out2.write("%s\n" % " ".join([ parts[0], parts[1], _checksum ]))
import sys
from hashlib import md5
for l in sys.stdin:
if not l.strip(): continue
parts = l.strip().split()
print md5(parts[2]).hexdigest(), parts[2]
Upvotes: 1
Reputation: 31
Could not figure for which string you want to compute the md5, this one-liner does it on the whole line, and outputs the processed 'input_file' as you wish in 'file1' and 'file2':
awk '{ "md5 -q -s \""$0"\"" | getline md5;
print md5" "$3 > "file1";
print $1" "$2" "md5 > "file2" }' input_file
Hope it helps..
Upvotes: 1