codeblaze
codeblaze

Reputation: 73

search pattern from one file in another file and writing the line after the match into a third file

I want to search the content of one file into another file and print the matched line and line that followed the matched line from the second file. The content of the first file can be found in the lines starting with > under GN column in the second file. I want to write the line that matches (starting with >) followed by the line after that which has the sequence of amino acid ( string of capital letters starting with "M")

File 1:

thrB
yaaX
thrC
dnaK
dnaJ

File 2:

>sp|B1XBC8|KHSE_ECODH Homoserine kinase OS=Escherichia coli (strain K12 / DH10B) OX=316385 GN=thrB PE=3 SV=1
MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEP
>sp|P0AD61|KPYK1_ECOLI Pyruvate kinase I OS=Escherichia coli (strain K12) OX=83333 GN=pykF PE=1 SV=1
MKKTKIVCTIGPKTESEEMLAKMLDAGMNVMRLNFSHGDYAEHGQRIQNLRNVMSKTGKT
>sp|P75616|YAAX_ECOLI Uncharacterized protein YaaX OS=Escherichia coli (strain K12) OX=83333 GN=yaaX PE=3 SV=1
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQH

and i am expecting output as:

>sp|B1XBC8|KHSE_ECODH Homoserine kinase OS=Escherichia coli (strain K12 / DH10B) OX=316385 GN=thrB PE=3 SV=1
MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEP
>sp|P75616|YAAX_ECOLI Uncharacterized protein YaaX OS=Escherichia coli (strain K12) OX=83333 GN=yaaX PE=3 SV=1
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQH

so far I have tried grep -F -f file1 file2 which only prints the line with the match found

with awk I only have written awk 'NR==FNR{a[$1]++;next}{} file1 file2 I can print the matching line but I don't know how to print the line after that (starting with "M").

Can anyone help me in getting through this?

I would be really grateful for your help.

Also, what if my second file has multiple matches of the string in file 1 and I want to print all such occurrences?

Thanks in Advance

Upvotes: 1

Views: 699

Answers (3)

Franck Theeten
Franck Theeten

Reputation: 140

If the two files are in fasta format, you should rather try blastn from the blast+ suite of tools, which is optimized in performance and will give you extra information on the pairwise alignment (identity rate, length of the overlap, number of missmatches and gaps)

blastn -query file1 -subject file2 -oufmt 6 > outfile.csv

And then parse the result if you want the output as a third fasta file.

Upvotes: 0

Sundeep
Sundeep

Reputation: 23697

If you have GNU grep

grep --no-group-separator -A1 -Ff file1 file2
  • -A1 will tell grep to print the matching line as well as the next line
  • by default, the output groups will be separated by --, so use --no-group-separator if you wish to avoid this line

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133770

Could you please try following.

awk '
FNR==NR{
  a[$0]
  next
}
match($0,/GN=[^ ]*/){
  str=substr($0,RSTART+3,RLENGTH-3)
}
(str in a) && /^>/{
  found=1
  val=$0
  next
}
found && /^M/{
  print val ORS $0
}
{
  val=found=""
}
'  Input_file1  Input_file2

Upvotes: 0

Related Questions