Koushik Chandra
Koushik Chandra

Reputation: 1491

Unix Performance Improvement - may be using AWK

I have two files File1.txt (it is having 6 columns separated by pipe) and File2.txt (it is having 2 columns separated by pipe)

File1.txt

NEW|abcd|1234|10000000|Hello|New_value|
NEW|abcd|1234|20000000|Hello|New_value|
NEW|xyzq|5678|30000000|myname|New_Value|

File2.txt

10000000|10000001>10000002>10000003>10000004
19000000|10000000>10000001>10000002>10000003>10000004
17000000|10000099>10000000>10000001>10000002>10000003>10000004
20000000|10000001>10000002>10000003>10000004>30000000
29000000|20000000>10000001>10000002>10000003>10000004

The goal is for each row in File1.txt, I have to pick up 4th column and have to search that value in File2.txt. If any match found in File2.txt, then I have to pick up all the rows but only the first column from File2.txt.

This may produce more number of records in the target file. The output should look like below (where the last column 123 is coming from a fixed variable)

NEW|abcd|1234|10000000|Hello|New_value|123    (this row comes as it matches 1st row & 4th column of File1.txt with 1st row of File2.txt)
NEW|abcd|1234|19000000|Hello|New_value|123    (this row comes as it matches 1st row & 4th column of File1.txt with 2nd row of File2.txt)
NEW|abcd|1234|17000000|Hello|New_value|123    (this row comes as it matches 1st row & 4th column of File1.txt with 3rd row of File2.txt)
NEW|abcd|1234|20000000|Hello|New_value|123    (this row comes as it matches 2nd row & 4th column of File1.txt with 4th row of File2.txt)
NEW|abcd|1234|29000000|Hello|New_value|123    (this row comes as it matches 2nd row & 4th column of File1.txt with 5th row of File2.txt)
NEW|xyzq|5678|20000000|myname|New_Value|123   (this row comes as it matches 3rd row & 4th column of File1.txt with 4th row of File2.txt)

I can write a solution like below, and it is giving me correct output as well. but this one taking 21 minute odd when both File1.txt and File2.txt are having approximately 150K rows. The final target file generated is having over 10 million rows in it.

VAL1=123

for ROW in `cat File1.txt`
do
  Fld1=`echo $ROW | cut -d'|' -f'1-3'`
  Fld2=`echo $ROW | cut -d'|' -f4`
  Fld3=`echo $ROW | cut -d'|' -f'5-6'`

  grep -i $Fld2 File2.txt | cut -d'|' -f1  > File3.txt
  sed 's/^/'$Fld1'|/g' File3.txt | sed 's/$/|'${Fld3}'|'${VAL1}'/g' >> Target.txt

done 

But my question is can this solution be optimized? Can this be re-written using AWK or any other way to do it faster?

Upvotes: 0

Views: 160

Answers (3)

jas
jas

Reputation: 10865

I'm pretty sure this will be faster (as using the implicit loops in a single awk or sed process is generally if not always faster than invoking it over and over in a shell loop), but you'll have to try it and let us know:

EDIT: This version should take care of the problem with duplicates in the output

$ cat a.awk
NR == FNR {
    for (i=1; i<=NF; ++i) {
        if ($i in a)
            a[$i] = a[$i] "," $1
        else
            a[$i] = $1;
    }
    next 
}

$4 in a {
    split(a[$4], b, ",")
    for (i in b) {
        if (!(b[i] in seen)) {
            print $1, $2, $3, b[i], $5, $6, new_value
            seen[b[i]]
        }
    }
    delete seen
}

The output contains the desired rows, although the order is different:

$ awk -v new_value=123 -v OFS="|" -f a.awk FS='[|>]' file2.txt FS='|' file1.txt 
NEW|abcd|1234|19000000|Hello|New_value|123
NEW|abcd|1234|17000000|Hello|New_value|123
NEW|abcd|1234|10000000|Hello|New_value|123
NEW|abcd|1234|29000000|Hello|New_value|123
NEW|abcd|1234|20000000|Hello|New_value|123
NEW|xyzq|5678|20000000|myname|New_Value|123

Upvotes: 1

oliv
oliv

Reputation: 13249

A slightly more optimized gnu awk script:

awk 'NR==FNR{a[$4]=$0;next}
     {
        for(i=1; i<=NF; i++){
          if($i in a) 
            print gensub("[^|]+\\|",$1 "|",4,a[$i])
        }
     }' FS='|' file1 FS='[|>]' file2

The first statement fills the array a with the content of the file1.

The second block statement iterates over all fields of file2 and prints the array content matching the first field of file2.

The string printed is modified with awk gensub function. It allows to only changed the 4th pattern found.

Upvotes: 0

jmdeldin
jmdeldin

Reputation: 5404

I would guess your performance slow downs are coming from reading the files into memory repeatedly for grep, for sed, and again for sed. If you can store the contents of File2 in memory (or even in a temporary SQLite DB), then that should speed things up. Then you would be processing File1 line-by-line and just doing a simple lookup for File2 keys.

It would be helpful to run htop or some activity monitor when running your script to track RAM and CPU usage.

Upvotes: 0

Related Questions