Clairessa Brown
Clairessa Brown

Reputation: 23

Combine certain lines that match regex in text file

I have several large files that are full of biological sequences I would like to break up by a specific number of sequences. However, there's some file formatting that needs to happen before that.

In each biological sequence, there is 1 line that begins with ">k..." that is the sequence header. The next 1 or 1+ lines are the biological sequence. All sequences have 1 header but some have 2 or more lines of sequence. I would like to combine the sequences under the same sequence headers changing the multi-line sequences into 1 long line of sequence.

>k141_0_1 # 86 # 388 # -1 # ID=1_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.703
PSSSAVGGTVTGDTQGCWRVDELRLRGGDDAEWARVIETHSAIIESVLRRRVGDASMRAE
VRDAVWARAFFEGLEPGEHAPVLPKELAEKPRLGGDRHRE*
>k141_964934_1 # 3 # 341 # -1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.699
AVSTCVYVYSLGYMDRDVEDRADPVRAPNVRRFFCLLDFFLFSMAILVLAGNIAVLLIGW
TCVGLSSFLLISYWTGKPGTLSAGLQALAANAIGDAALLVALVLVPAGCGDLL
>k141_1688630_1 # 1 # 150 # 1 # ID=3_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.707
ALLLVLVLRVVHAYHSERLSDVADEEAELNARLEREEAPQHAEEEAAAL*
>k141_1688630_2 # 147 # 416 # 1 # ID=3_2;partial=01;start_type=GTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.748
MTALSIALLAPWAAGIVLVALDGRRRLIGWLAIGALFANLAGLTILAVSVLSDDPEVATT
GNWPTGVGITLRADALGVLFALLSSPRAAR
>k141_361851_1 # 2 # 388 # 1 # ID=4_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.721
PATLTAAGGVFASGLSSRGRLVSGAAPKFYGNPLVAWTPAPGASAYEVQWSKTRYPFRPE
KDPQNGNAFGRLTLGTSAVLPLRPGVWYYRVRGFSFALPTGAQQLSWSDPARIVVAKPTF
RVVRRKHK*
>k141_241234_1 # 224 # 373 # -1 # ID=5_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.713
MVSLIGGLLTFTLGTGLVTWGAAVRGAMEHDGTLRGAGRLPQGASQEAS*
>k141_1206166_1 # 179 # 322 # -1 # ID=6_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
YQYAWTDLLGPTLVWDQVARGVLWSLAYSLVLYAAAWWHFLRKDVLS*
>k141_482468_1 # 123 # 314 # -1 # ID=7_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.615
AQFALRWILMNEAVSVVIPGARNPEQAIANTQASELPALSVNQMEAANAIFDRLIRPHVH
QRW*
>k141_1447399_1 # 3 # 317 # -1 # ID=8_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.711
RTRPGSSPRGWFGPHLEALWTYLHEHHHISYARLEAIGRDLWHLAVSQGALANALRRTAT
RLRPEAGAIREQVRASPTIGSDETSARVNGRTHWQWVFQTPTASY
>k141_1_1 # 2 # 364 # 1 # ID=9_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
SYMFVTGPDVVKTVTHETVTQEELGGAVTHTTRSGVADLAFENDVEALLQLRRFMDFLPS
SNREKPPVRPTWDSPDREEASLDTLIPANPNKPYDMKELILKVVDEGDFFEIQPTYARNI
I
>k141_964935_1 # 2 # 235 # 1 # ID=10_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.726
LADLSRDPDLRAELQRAVDRVNDGAAHHARIRRFSVVPRPFSADAGEITPTLKLKRRVIE
ERFADEIEALYAPLVRS*

I am currently using a while loop that iterates over all lines of the text file. I then use awk and pattern matching in an if-else statement to check whether there are two consecutive lines that have sequences in them. However, my pattern-matching isn't working and I'm not sure why.

I have tried grep but grep reads in the entire file at a time. I've tried sed but that didn't work, along with trying to use tr, when trying to remove \n characters from specific lines.

I would appreciate any help, including a totally different way of approaching this problem.

file=protein_test.fa 
i=1 # for line counter
prot_reg="^[A-Z]{10,}" # regex for biological sequence
while read -r line; do 

    # Read in 2 lines at the same time
    awk1=$(awk -v i=$i 'NR==i' < $file)
    awk2=$(awk -v i=$i 'NR==i+1' < $file)

    if [[ ${awk1}=~$prot_reg && ${awk2}=~$prot_reg ]]
    then
        echo $awk1$awk2
    else
        echo $awk1
        echo $awk2
    fi

    let i=i+1 # til all lines read, add 1 to i

done < $file

Here's what I'd like:

>k141_0_1 # 86 # 388 # -1 # ID=1_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.703
PSSSAVGGTVTGDTQGCWRVDELRLRGGDDAEWARVIETHSAIIESVLRRRVGDASMRAEVRDAVWARAFFEGLEPGEHAPVLPKELAEKPRLGGDRHRE*
>k141_964934_1 # 3 # 341 # -1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.699
AVSTCVYVYSLGYMDRDVEDRADPVRAPNVRRFFCLLDFFLFSMAILVLAGNIAVLLIGWTCVGLSSFLLISYWTGKPGTLSAGLQALAANAIGDAALLVALVLVPAGCGDLL
>k141_1688630_1 # 1 # 150 # 1 # ID=3_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.707
ALLLVLVLRVVHAYHSERLSDVADEEAELNARLEREEAPQHAEEEAAAL*
>k141_1688630_2 # 147 # 416 # 1 # ID=3_2;partial=01;start_type=GTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.748
MTALSIALLAPWAAGIVLVALDGRRRLIGWLAIGALFANLAGLTILAVSVLSDDPEVATTGNWPTGVGITLRADALGVLFALLSSPRAAR
>k141_361851_1 # 2 # 388 # 1 # ID=4_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.721
PATLTAAGGVFASGLSSRGRLVSGAAPKFYGNPLVAWTPAPGASAYEVQWSKTRYPFRPEKDPQNGNAFGRLTLGTSAVLPLRPGVWYYRVRGFSFALPTGAQQLSWSDPARIVVAKPTFRVVRRKHK*
>k141_241234_1 # 224 # 373 # -1 # ID=5_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.713
MVSLIGGLLTFTLGTGLVTWGAAVRGAMEHDGTLRGAGRLPQGASQEAS*
>k141_1206166_1 # 179 # 322 # -1 # ID=6_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
YQYAWTDLLGPTLVWDQVARGVLWSLAYSLVLYAAAWWHFLRKDVLS*
>k141_482468_1 # 123 # 314 # -1 # ID=7_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.615
AQFALRWILMNEAVSVVIPGARNPEQAIANTQASELPALSVNQMEAANAIFDRLIRPHVHQRW*
>k141_1447399_1 # 3 # 317 # -1 # ID=8_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.711
RTRPGSSPRGWFGPHLEALWTYLHEHHHISYARLEAIGRDLWHLAVSQGALANALRRTATRLRPEAGAIREQVRASPTIGSDETSARVNGRTHWQWVFQTPTASY
>k141_1_1 # 2 # 364 # 1 # ID=9_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
SYMFVTGPDVVKTVTHETVTQEELGGAVTHTTRSGVADLAFENDVEALLQLRRFMDFLPSSNREKPPVRPTWDSPDREEASLDTLIPANPNKPYDMKELILKVVDEGDFFEIQPTYARNII
>k141_964935_1 # 2 # 235 # 1 # ID=10_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.726
LADLSRDPDLRAELQRAVDRVNDGAAHHARIRRFSVVPRPFSADAGEITPTLKLKRRVIEERFADEIEALYAPLVRS*

Upvotes: 2

Views: 116

Answers (1)

Ed Morton
Ed Morton

Reputation: 203229

Is this what you're trying to do?

$ awk '{printf "%s", (/^>/ ? s $0 ORS : $0); s=ORS} END{print ""}' file
>k141_0_1 # 86 # 388 # -1 # ID=1_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.703
PSSSAVGGTVTGDTQGCWRVDELRLRGGDDAEWARVIETHSAIIESVLRRRVGDASMRAEVRDAVWARAFFEGLEPGEHAPVLPKELAEKPRLGGDRHRE*
>k141_964934_1 # 3 # 341 # -1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.699
AVSTCVYVYSLGYMDRDVEDRADPVRAPNVRRFFCLLDFFLFSMAILVLAGNIAVLLIGWTCVGLSSFLLISYWTGKPGTLSAGLQALAANAIGDAALLVALVLVPAGCGDLL
>k141_1688630_1 # 1 # 150 # 1 # ID=3_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.707
ALLLVLVLRVVHAYHSERLSDVADEEAELNARLEREEAPQHAEEEAAAL*
>k141_1688630_2 # 147 # 416 # 1 # ID=3_2;partial=01;start_type=GTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.748
MTALSIALLAPWAAGIVLVALDGRRRLIGWLAIGALFANLAGLTILAVSVLSDDPEVATTGNWPTGVGITLRADALGVLFALLSSPRAAR
>k141_361851_1 # 2 # 388 # 1 # ID=4_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.721
PATLTAAGGVFASGLSSRGRLVSGAAPKFYGNPLVAWTPAPGASAYEVQWSKTRYPFRPEKDPQNGNAFGRLTLGTSAVLPLRPGVWYYRVRGFSFALPTGAQQLSWSDPARIVVAKPTFRVVRRKHK*
>k141_241234_1 # 224 # 373 # -1 # ID=5_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.713
MVSLIGGLLTFTLGTGLVTWGAAVRGAMEHDGTLRGAGRLPQGASQEAS*
>k141_1206166_1 # 179 # 322 # -1 # ID=6_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
YQYAWTDLLGPTLVWDQVARGVLWSLAYSLVLYAAAWWHFLRKDVLS*
>k141_482468_1 # 123 # 314 # -1 # ID=7_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.615
AQFALRWILMNEAVSVVIPGARNPEQAIANTQASELPALSVNQMEAANAIFDRLIRPHVHQRW*
>k141_1447399_1 # 3 # 317 # -1 # ID=8_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.711
RTRPGSSPRGWFGPHLEALWTYLHEHHHISYARLEAIGRDLWHLAVSQGALANALRRTATRLRPEAGAIREQVRASPTIGSDETSARVNGRTHWQWVFQTPTASY
>k141_1_1 # 2 # 364 # 1 # ID=9_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
SYMFVTGPDVVKTVTHETVTQEELGGAVTHTTRSGVADLAFENDVEALLQLRRFMDFLPSSNREKPPVRPTWDSPDREEASLDTLIPANPNKPYDMKELILKVVDEGDFFEIQPTYARNII
>k141_964935_1 # 2 # 235 # 1 # ID=10_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.726
LADLSRDPDLRAELQRAVDRVNDGAAHHARIRRFSVVPRPFSADAGEITPTLKLKRRVIEERFADEIEALYAPLVRS*

Please read the following for information on the main issues with the code you posted:

  1. why-is-using-a-shell-loop-to-process-text-considered-bad-practice
  2. https://mywiki.wooledge.org/Quotes

Upvotes: 2

Related Questions