Philip Francis
Philip Francis

Reputation: 3

awk to match string from file against another file and get previous and next 2 lines

i am trying to match string from a file against another file to fetch the matched line along with the previous and next 2 lines.

i could do this with grep for a chuck file, but throws memory exhausted on the original(200M lines of keys and a 2TB input source file).

grep --no-group-separator -A 2 -B 1 -f key source

sample key file

^CNACCCAAGGCTCATT  
^ANAGCGGCAACTCGCG  

I added the "^" to each line since the key is the starting 16 characters of the line next to the one starting with '@'

The pattern is formed of the characters ATGCN having length 16 and they are random. There could be multiple matches in the source file against a pattern

sample search against file

@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC  
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC  
+  
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF  
@A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC  
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA  
+  
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF  
@A00354:427:HVYWLDSXY:1:1101:1271:1000 1:N:0:ATTACTTC  
CNATCCCGTCTCGAGCCCGCCCCAATAGCAACAACAACAACAACAACAACAACAACAGCAACAACACCAGCAACACCAGCAACAACAGCAACAACAACAACAGCAACAACAACAACAACAACAACAACAACAACAACAACAACAACAAGA  
+  
F#FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  
@A00354:427:HVYWLDSXY:1:1101:1325:1000 1:N:0:ATTACTTC  
TNCGGTTCATAGGAATGTAGTCTTTGTAATTATGCGCAATTTCCAAACACTTCAAGGTTTTTTTGCAAATAAAACATTCAGGCCTCGTGTGTGCCGCTGCATCTTAGATCCAACGGCTCCTAGTTGCTCATATTCNACCCAAGGCTCATTAGGTGCTCCCCGTAGC  
+  
:#FFF:F,FFFFFFFFFFFF,:FFF::F,FFF,F:FFFFFFF:FFFF:FF:F:FFF:F:F:FFFFFFFF,FF,F:FF:FF::F,FFF:FFFFFF,:F::FFFFFFF:FF:FFFFF,FFFFFF,FFF:FFFFFFFFF,FFFF:FFFFFFF:  

even if i split the key file its painstakingly slow.

can it be done using perl one-liner or awk more efficiently.

The expected output would be

@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC  
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC  
+  
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF  
@A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC  
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA  
+  
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF

i saw code like

awk 'NR==FNR{a[$1]; next} {for (i in a) if (index($0, i)) print $1}' key source

which checks if each entry in key is a substring of the source, but i couldn't get my head around to make it check for a pattern(^CNACCCAAGGCTCATT) and fetch the prev. and next lines

another way i tried and couldn't make out was, zcat key | match each line against source file > output

*may be the slowness is because of my code, any alternate is much appreciated

Upvotes: 0

Views: 801

Answers (1)

Ed Morton
Ed Morton

Reputation: 203684

for (i in a) if (index($0, i)) would be immensely slow because you're looping 100,000,000 times per line of your "search" file (so 100M * 2TB loop iterations!) and it'd produce incorrect output as index($0, i) would find the target key anywhere on search line rather than at the start, it would have to be index($0, i) == 1 to only match at the start.

This is how to do it in awk after removing all those ^s from the start of your "key" file lines as we're going to do an efficient hash lookup with strings, not a slow regexp comparison as would be required with grep, and we're going to do 1 hash lookup per line of "source" instead of 100M string comparisons as in the awk script in your question:

$ cat tst.awk
NR==FNR { tgts[$1]; next }
c && !(--c) { print p3 ORS p2 ORS p1 ORS $0; f=0 }
{ key=substr($0,1,16); p3=p2; p2=p1; p1=$0 }
key in tgts { c=2 }

$ awk -f tst.awk key source
@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC
+
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF
@A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF

See printing-with-sed-or-awk-a-line-following-a-matching-pattern/17914105#17914105 for more information on what c=2 and c && !(--c) is doing but it's setting a count for a number of lines and then becoming true (and so executing the associated action of printing the saved lines) when the count reaches zero again.

If that exceeds available memory, let us know as another approach can look something like the following pseudo-code (I am not suggesting you do this in shell!):

sort keys
sort source by middle line keeping groups of 3 lines together
while !done; do
    read tgt < keys
    while read source_line; do
        key = substr(line,1,16)
        if key == tgt; then
            print line+context
        else if key > tgt; then
            break
        fi
    done < source
done

so the idea is you don't read the next value from "key" until the current value from "source" is bigger then the one you were using. That would reduce memory usage to close to zero but it does require both input files to be sorted.

Upvotes: 1

Related Questions