Reputation: 73
I want to search the content of one file into another file and print the matched line and line that followed the matched line from the second file. The content of the first file can be found in the lines starting with >
under GN
column in the second file. I want to write the line that matches (starting with >
) followed by the line after that which has the sequence of amino acid ( string of capital letters starting with "M")
File 1:
thrB
yaaX
thrC
dnaK
dnaJ
File 2:
>sp|B1XBC8|KHSE_ECODH Homoserine kinase OS=Escherichia coli (strain K12 / DH10B) OX=316385 GN=thrB PE=3 SV=1
MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEP
>sp|P0AD61|KPYK1_ECOLI Pyruvate kinase I OS=Escherichia coli (strain K12) OX=83333 GN=pykF PE=1 SV=1
MKKTKIVCTIGPKTESEEMLAKMLDAGMNVMRLNFSHGDYAEHGQRIQNLRNVMSKTGKT
>sp|P75616|YAAX_ECOLI Uncharacterized protein YaaX OS=Escherichia coli (strain K12) OX=83333 GN=yaaX PE=3 SV=1
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQH
and i am expecting output as:
>sp|B1XBC8|KHSE_ECODH Homoserine kinase OS=Escherichia coli (strain K12 / DH10B) OX=316385 GN=thrB PE=3 SV=1
MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEP
>sp|P75616|YAAX_ECOLI Uncharacterized protein YaaX OS=Escherichia coli (strain K12) OX=83333 GN=yaaX PE=3 SV=1
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQH
so far I have tried
grep -F -f file1 file2
which only prints the line with the match found
with awk I only have written
awk 'NR==FNR{a[$1]++;next}{} file1 file2
I can print the matching line but I don't know how to print the line after that (starting with "M").
Can anyone help me in getting through this?
I would be really grateful for your help.
Also, what if my second file has multiple matches of the string in file 1 and I want to print all such occurrences?
Thanks in Advance
Upvotes: 1
Views: 699
Reputation: 140
If the two files are in fasta format, you should rather try blastn from the blast+ suite of tools, which is optimized in performance and will give you extra information on the pairwise alignment (identity rate, length of the overlap, number of missmatches and gaps)
blastn -query file1 -subject file2 -oufmt 6 > outfile.csv
And then parse the result if you want the output as a third fasta file.
Upvotes: 0
Reputation: 23697
If you have GNU grep
grep --no-group-separator -A1 -Ff file1 file2
-A1
will tell grep to print the matching line as well as the next line--
, so use --no-group-separator
if you wish to avoid this lineUpvotes: 2
Reputation: 133770
Could you please try following.
awk '
FNR==NR{
a[$0]
next
}
match($0,/GN=[^ ]*/){
str=substr($0,RSTART+3,RLENGTH-3)
}
(str in a) && /^>/{
found=1
val=$0
next
}
found && /^M/{
print val ORS $0
}
{
val=found=""
}
' Input_file1 Input_file2
Upvotes: 0