Reputation: 3
i am trying to match string from a file against another file to fetch the matched line along with the previous and next 2 lines.
i could do this with grep for a chuck file, but throws memory exhausted on the original(200M lines of keys and a 2TB input source file).
grep --no-group-separator -A 2 -B 1 -f key source
sample key file
^CNACCCAAGGCTCATT
^ANAGCGGCAACTCGCG
I added the "^" to each line since the key is the starting 16 characters of the line next to the one starting with '@'
The pattern is formed of the characters ATGCN having length 16 and they are random. There could be multiple matches in the source file against a pattern
sample search against file
@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC
+
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF
@A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF
@A00354:427:HVYWLDSXY:1:1101:1271:1000 1:N:0:ATTACTTC
CNATCCCGTCTCGAGCCCGCCCCAATAGCAACAACAACAACAACAACAACAACAACAGCAACAACACCAGCAACACCAGCAACAACAGCAACAACAACAACAGCAACAACAACAACAACAACAACAACAACAACAACAACAACAACAAGA
+
F#FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00354:427:HVYWLDSXY:1:1101:1325:1000 1:N:0:ATTACTTC
TNCGGTTCATAGGAATGTAGTCTTTGTAATTATGCGCAATTTCCAAACACTTCAAGGTTTTTTTGCAAATAAAACATTCAGGCCTCGTGTGTGCCGCTGCATCTTAGATCCAACGGCTCCTAGTTGCTCATATTCNACCCAAGGCTCATTAGGTGCTCCCCGTAGC
+
:#FFF:F,FFFFFFFFFFFF,:FFF::F,FFF,F:FFFFFFF:FFFF:FF:F:FFF:F:F:FFFFFFFF,FF,F:FF:FF::F,FFF:FFFFFF,:F::FFFFFFF:FF:FFFFF,FFFFFF,FFF:FFFFFFFFF,FFFF:FFFFFFF:
even if i split the key file its painstakingly slow.
can it be done using perl one-liner or awk more efficiently.
The expected output would be
@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC + :#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF @A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA + F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF
i saw code like
awk 'NR==FNR{a[$1]; next} {for (i in a) if (index($0, i)) print $1}' key source
which checks if each entry in key is a substring of the source, but i couldn't get my head around to make it check for a pattern(^CNACCCAAGGCTCATT) and fetch the prev. and next lines
another way i tried and couldn't make out was, zcat key | match each line against source file > output
*may be the slowness is because of my code, any alternate is much appreciated
Upvotes: 0
Views: 801
Reputation: 203684
for (i in a) if (index($0, i))
would be immensely slow because you're looping 100,000,000 times per line of your "search" file (so 100M * 2TB loop iterations!) and it'd produce incorrect output as index($0, i)
would find the target key anywhere on search line rather than at the start, it would have to be index($0, i) == 1
to only match at the start.
This is how to do it in awk after removing all those ^
s from the start of your "key" file lines as we're going to do an efficient hash lookup with strings, not a slow regexp comparison as would be required with grep, and we're going to do 1 hash lookup per line of "source" instead of 100M string comparisons as in the awk script in your question:
$ cat tst.awk
NR==FNR { tgts[$1]; next }
c && !(--c) { print p3 ORS p2 ORS p1 ORS $0; f=0 }
{ key=substr($0,1,16); p3=p2; p2=p1; p1=$0 }
key in tgts { c=2 }
$ awk -f tst.awk key source
@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC
+
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF
@A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF
See printing-with-sed-or-awk-a-line-following-a-matching-pattern/17914105#17914105 for more information on what c=2
and c && !(--c)
is doing but it's setting a count for a number of lines and then becoming true (and so executing the associated action of printing the saved lines) when the count reaches zero again.
If that exceeds available memory, let us know as another approach can look something like the following pseudo-code (I am not suggesting you do this in shell!):
sort keys
sort source by middle line keeping groups of 3 lines together
while !done; do
read tgt < keys
while read source_line; do
key = substr(line,1,16)
if key == tgt; then
print line+context
else if key > tgt; then
break
fi
done < source
done
so the idea is you don't read the next value from "key" until the current value from "source" is bigger then the one you were using. That would reduce memory usage to close to zero but it does require both input files to be sorted.
Upvotes: 1