Susheel Busi
Susheel Busi

Reputation: 163

finding pattern and then finding second pattern in lines before

I have a file that looks like this:

LOCUS       contig_142             11028 bp    DNA              UNK 07-JUN-2020
DEFINITION  .
ACCESSION
VERSION
KEYWORDS    .
SOURCE      tx-145
  ORGANISM  tx-145
            Unclassified.
COMMENT     .
FEATURES             Location/Qualifiers
     CDS             38..1026
                     /locus_tag="tx-145_00001"
                     /transl_table=11
                     /translation="VRLPQKKQLIHTELLDGLSAKMDFSPYLAEEHNPVQSARPVPRKK
                     PYQGDVPLEALLEDIKARTKVPAYRLRVRRGKTPGLTDSKIGGLPYWDLSQPYPADEKG
                     QPMQLLAQINFGAEDMDKPFPKTGLLQFFIGLDEMFGCNFAYAPDQKNYRVVYHPEIDG
                     SVTPDKVSALGVPGLVNDYRTSPLEAELAIYAEREDSFANDRSFVFEDAFRAAVQAVMG
                     VDMGEKESYEFLDEDAYDELFESFQETDDGCMNGGHWMLGYPSFTQEDPRPEDSPFDTL
                     LLQIDSMRDEDGGNPILWGDCGVCNFFIARTDLEKLDFSQVLYNWDCC"
     CDS             1255..2219
                     /locus_tag="tx-145_00002"
                     /transl_table=11
                     /translation="MKQRIFITLLLLVLLLASCGQAAQPHAQSEPAATPSEVEKIAFTD
                     ALGQDFFIDPPQRAVVMIGSFADVWVLAGGEDVLAATANDAWESYALDLPEDTVNIGSP
                     MKPNVELVLGAQPDLIIASSLSPSNLELQETFQRAGIPAAYFDVSSFQDYLDLLELFTR
                     LTGRPENYETYGAAVKAQVDGAVDRRVEYSFAPTVLTIQVSGSSVKVKNSEDNVLGPML
                     KELGCENIADRDGSLLEDLSLEAILQADPDFIFAVYHGTDEAAARANLEESLLSNPAWA
                     SLSAVEGGRFHILERRMFSLKPNALWGDAYEQLADILCGE"

I would like to use grep/awk/sed to find the locus_tag tx-145_00002, and if found, I need to retrieve the contig ID, i.e. contig_142 that is several lines before the first match.

Note: I have tried to use grep -B NUMBER_of_lines, but the number of lines between two matches is not always consistent, and highly variable from sample to sample.

Appreciate your help with this. Thank you!

Sorry about editing this late, but if possible, my expected output should be this:

contig_142
tx-145_00002

Upvotes: 1

Views: 225

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133458

Could you please try following. Written and test with shown samples in GNU awk

awk -v valtofind="tx-145_00002" '
/^LOCUS/{
  id=$2
  next
}
/\/locus_tag/ && $0 ~ "\""valtofind"\"$" {
  print id,valtofind
  id=""
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk -v valtofind="tx-145_00002" '                 ##Starting awk program from here and setting variable vartofind to value which OP wants to look.
/^LOCUS/{                                         ##Checking condition if a line starts from LOCUS then do following.
  id=$2                                           ##Setting id with $2 of current line.
  next                                            ##next will skip all further statements from here.
}
/\/locus_tag/ && $0 ~ "\""valtofind"\"$" {        ##Checking condition if line has /locus_tag and variable at the end of line then do following.
  print id,valtofind                              ##Printing id and variable here.
  id=""                                           ##Nullifying id here.
}
' Input_file                                      ##Mentioning Input_file name here.

Upvotes: 1

potong
potong

Reputation: 58371

This might work for you (GNU sed):

sed -E '/^LOCUS/h;/locus_tag.*tx-145_00002/!d;x;s/^\S+\s+(\S+).*/\1/' file

On a match with LOCUS make a copy of this line in the hold space.

On a match of locus_tag and tx-145_00002, swap to the copy and extract the id.

Upvotes: 1

Related Questions