Reputation: 286

Bash script writing 2 files with different output elements per input line

I want to use a bash script to process 1 input file into 2 output files, each containing the same number of lines as the input file but with different parts of the input line. In particular one of the output files has to contain a md5hash of a selection of the input line, (hash calculated per line, not per file!):

Input_file.txt: ** 3 fields, separated by space

12347654 abcdfg 1verylongalpha1234numeric1

34543673 nvjfur 2verylongalpha1234numeric2

75868643 vbdhde 3verylongalpha1234numeric3

output file_1.txt would have to look like this: (left field is MD5sum, right field is field3 from input file which is also contained in the MD5hash):

12df5j754G75f738fjk3483df3fdf9 1verylongalpha1234numeric1

3jf75j47fh4G84ka9J884hs355jhd8 2verylongalpha1234numeric2

4hf7dn46chG4875ldgkk348fk345d9 3verylongalpha1234numeric3

output file_2.txt would have to look like this: (field1 and field2 from input file + MD5HASH)

12347654 abcdfg 12df5j754G75f738fjk3483df3fdf9

34543673 nvjfur 3jf75j47fh4G84ka9J884hs355jhd8

75868643 vbdhde 4hf7dn46chG4875ldgkk348fk345d9

I already have a script that doesthe job but it performs very badly: (script below may not work, this is from the top of my head, no linux here where I write this, sorry)

#!/bin/bash

While read line

do   MD5_HASH=${sed -nr 's/^[[:digit:]]*\s[[:alpha:]]*\s([[:alnum:]]*)/\1/p' <<<$line     | md5sum} 
read $line DATA_PART1 DATA_PART2 DATA_PART3

echo "$MD5_HASH $DATA_PART3" >> file_1.txt    ##append file_2.txt in loop THIS IS WHERE IT GETS HORRIBLY SLOW!

echo "$DATA_PART1 $DATA_PART2 $MD5_HASH" 
done < input_file.txt > file_2.txt

exit 0

I think that the "redirect stdout to file with append construct" '>>' is responsible for the slow performance, but I can't think of another way. Its in the loop because I have to calculate the md5hash per line.

(and oh, the sed command is necessary because in reality the part that goes into the MD5SUM can only be captured with regex and a quite complex pattern)

So anyone have a suggestion?

Upvotes: 1

Answers (5)

glenn jackman

Reputation: 246847

Your bash script can be tidied up a bit. Note that the read command can read the 3 fields into separate variables:

#!/bin/bash
rm -f file_1.txt file_2.txt    
While read f1 f2 f3; do
    hash=$(md5sum <<< $f3)
    printf "%s %s\n" "$hash" "$f3" >> file_1.txt
    printf "%s %s %s\n" "$f1" "$f2" "$hash" >> file_2.txt
done < input_file.txt

Upvotes: 2

Thor

Reputation: 47109

You may be able to increase the efficiency with pipes and parallel.

According to your pseudo-code, you want the md5 sum of the last element:

paste -d ' '     \
  input_file.txt \
  <(cut -d' ' -f3 input_file.txt | parallel echo '{}' \| md5sum | cut -d' ' -f1) |
  awk '{ print $4, $3 > "file_1.txt"; print $1, $2, $4 > "file_2.txt" }'

Explanation

The md5 sum is calculated in parallel in the process substitution, the output from here is "pasted" onto the original file. Finally awk takes care of placing the output into the correct files.

Edit

I agree with redShadow that this will never be very efficient in shell, as you need to sub-shell a lot. Here's an alternative in perl:

split.pl

use Digest::MD5 qw(md5_hex);
use v5.10;

open O1, ">file_1.txt" or die $!; open O2, ">file_2.txt" or die $!;

$, = " ";

while(<>) { chomp; 
  @F = split / +/;
  $md5 = md5_hex $F[2];
  say O1 $md5, $F[2];
  say O2 @F[0,1], $md5;
}
close O1; close O2;

Run like this:

<input_file.txt perl split.pl

Output in both cases:

file_1.txt

765ac5d0002aed1141a6a4e7b90e4ac9 1verylongalpha1234numeric1
b31901def07d436aed2c8028b2efa4ec 2verylongalpha1234numeric2
0722a6e50f6f8726f9754e7f71f9ad2c 3verylongalpha1234numeric3

file_2.txt

12347654 abcdfg 765ac5d0002aed1141a6a4e7b90e4ac9
34543673 nvjfur b31901def07d436aed2c8028b2efa4ec
75868643 vbdhde 0722a6e50f6f8726f9754e7f71f9ad2c

Upvotes: 0

Mark Reed

Reputation: 95267

You can write both files at the same time from bash, like this:

; function to remove extraneous filename output from md5sum.  omit on 
; OS X, which has 'md5' command that already works this way.
md5() { set -- $(md5sum "$@"); echo "$1"; }

exec 3>file_1.txt 4>file_2.txt
while read left middle right; do
  md5="$(echo -n "$right" | md5)"
  echo >&3 "$md5 $right"
  echo >&4 "$left $middle $md5"
done <input_file.txt
exec 3>&- 4>&-

That assumes the simple whitespace-separated fields of your example; you would of course still have to do whatever sed magic is required to get the actual target for the MD5 sum.

It won't be very efficient, though. For better performance, you should use something like Perl or Python, which can do both the field extraction you're using sed for and the MD5 calculation all within a single process that is also much faster than the shell at looping over lines of input. Perl example:

perl -MDigest::MD5=md5_hex -lane '
  BEGIN { open $f1, ">file_1.txt"; open $f2, ">file_2.txt" }
  $md5 = md5_hex $F[2];
  print $f1 "$md5 $F[2]";
  print $f2 "$F[0] $F[1] $md5";
' input_file.txt

Upvotes: 0

redShadow

Reputation: 6777

This is one case in which I'd use a fully-featured language, such as Python.

Although you might find a way to do this by using only the standard gnu tools, you'd very likely end up with a solution that will be:

very complex, hard to read and maintain
inefficient, as the tools don't provide a straight-forward way to do this.

1. Creating the first file in Python

from hashlib import md5
with open('input.txt', 'r') as infile:
    for l in infile:
        if not l.strip(): continue
        parts = l.strip().split()
        print md5(parts[2]).hexdigest(), parts[2]

2. Creating the second file in Python

from hashlib import md5
with open('input.txt', 'r') as infile:
    for l in infile:
        if not l.strip(): continue
        parts = l.strip().split()
        print parts[0], parts[1], md5(parts[2]).hexdigest()

I'm not sure about on which fields you calculated the checksum; however, of course, you can calculate it on whichever field(s) you want; you could also perform more complex regexp-based matching on the lines; and you can speed up things by outputting the two files at once, thus avoiding calculating the md5 twice.

3. Creating the two files at once

from hashlib import md5
with open('infile.txt','r')  as infile, open('out1.txt','w') as out1, open('out2.txt','w') as out2:
    for l in infile:
        if not l.strip(): continue
        parts = l.strip().split()
        _checksum = md5(parts[2]).hexdigest()
        out1.write("%s\n" % " ".join([ _checksum, parts[2] ]))
        out2.write("%s\n" % " ".join([ parts[0], parts[1], _checksum ]))

4. Same as #1 but reading from standard input

import sys
from hashlib import md5
for l in sys.stdin:
    if not l.strip(): continue
    parts = l.strip().split()
    print md5(parts[2]).hexdigest(), parts[2]

Upvotes: 1

Catalin Popescu

Reputation: 31

Could not figure for which string you want to compute the md5, this one-liner does it on the whole line, and outputs the processed 'input_file' as you wish in 'file1' and 'file2':

awk '{ "md5 -q -s \""$0"\"" | getline md5; 
     print md5" "$3 > "file1"; 
     print $1" "$2" "md5 > "file2" }' input_file

Hope it helps..

Upvotes: 1

Bash script writing 2 files with different output elements per input line

Answers (5)

Explanation

Edit

1. Creating the first file in Python

2. Creating the second file in Python

3. Creating the two files at once

4. Same as #1 but reading from standard input

Related Questions