Reputation: 23
I have several large files that are full of biological sequences I would like to break up by a specific number of sequences. However, there's some file formatting that needs to happen before that.
In each biological sequence, there is 1 line that begins with ">k..." that is the sequence header. The next 1 or 1+ lines are the biological sequence. All sequences have 1 header but some have 2 or more lines of sequence. I would like to combine the sequences under the same sequence headers changing the multi-line sequences into 1 long line of sequence.
>k141_0_1 # 86 # 388 # -1 # ID=1_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.703
PSSSAVGGTVTGDTQGCWRVDELRLRGGDDAEWARVIETHSAIIESVLRRRVGDASMRAE
VRDAVWARAFFEGLEPGEHAPVLPKELAEKPRLGGDRHRE*
>k141_964934_1 # 3 # 341 # -1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.699
AVSTCVYVYSLGYMDRDVEDRADPVRAPNVRRFFCLLDFFLFSMAILVLAGNIAVLLIGW
TCVGLSSFLLISYWTGKPGTLSAGLQALAANAIGDAALLVALVLVPAGCGDLL
>k141_1688630_1 # 1 # 150 # 1 # ID=3_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.707
ALLLVLVLRVVHAYHSERLSDVADEEAELNARLEREEAPQHAEEEAAAL*
>k141_1688630_2 # 147 # 416 # 1 # ID=3_2;partial=01;start_type=GTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.748
MTALSIALLAPWAAGIVLVALDGRRRLIGWLAIGALFANLAGLTILAVSVLSDDPEVATT
GNWPTGVGITLRADALGVLFALLSSPRAAR
>k141_361851_1 # 2 # 388 # 1 # ID=4_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.721
PATLTAAGGVFASGLSSRGRLVSGAAPKFYGNPLVAWTPAPGASAYEVQWSKTRYPFRPE
KDPQNGNAFGRLTLGTSAVLPLRPGVWYYRVRGFSFALPTGAQQLSWSDPARIVVAKPTF
RVVRRKHK*
>k141_241234_1 # 224 # 373 # -1 # ID=5_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.713
MVSLIGGLLTFTLGTGLVTWGAAVRGAMEHDGTLRGAGRLPQGASQEAS*
>k141_1206166_1 # 179 # 322 # -1 # ID=6_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
YQYAWTDLLGPTLVWDQVARGVLWSLAYSLVLYAAAWWHFLRKDVLS*
>k141_482468_1 # 123 # 314 # -1 # ID=7_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.615
AQFALRWILMNEAVSVVIPGARNPEQAIANTQASELPALSVNQMEAANAIFDRLIRPHVH
QRW*
>k141_1447399_1 # 3 # 317 # -1 # ID=8_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.711
RTRPGSSPRGWFGPHLEALWTYLHEHHHISYARLEAIGRDLWHLAVSQGALANALRRTAT
RLRPEAGAIREQVRASPTIGSDETSARVNGRTHWQWVFQTPTASY
>k141_1_1 # 2 # 364 # 1 # ID=9_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
SYMFVTGPDVVKTVTHETVTQEELGGAVTHTTRSGVADLAFENDVEALLQLRRFMDFLPS
SNREKPPVRPTWDSPDREEASLDTLIPANPNKPYDMKELILKVVDEGDFFEIQPTYARNI
I
>k141_964935_1 # 2 # 235 # 1 # ID=10_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.726
LADLSRDPDLRAELQRAVDRVNDGAAHHARIRRFSVVPRPFSADAGEITPTLKLKRRVIE
ERFADEIEALYAPLVRS*
I am currently using a while loop that iterates over all lines of the text file. I then use awk and pattern matching in an if-else statement to check whether there are two consecutive lines that have sequences in them. However, my pattern-matching isn't working and I'm not sure why.
I have tried grep but grep reads in the entire file at a time. I've tried sed but that didn't work, along with trying to use tr, when trying to remove \n characters from specific lines.
I would appreciate any help, including a totally different way of approaching this problem.
file=protein_test.fa
i=1 # for line counter
prot_reg="^[A-Z]{10,}" # regex for biological sequence
while read -r line; do
# Read in 2 lines at the same time
awk1=$(awk -v i=$i 'NR==i' < $file)
awk2=$(awk -v i=$i 'NR==i+1' < $file)
if [[ ${awk1}=~$prot_reg && ${awk2}=~$prot_reg ]]
then
echo $awk1$awk2
else
echo $awk1
echo $awk2
fi
let i=i+1 # til all lines read, add 1 to i
done < $file
Here's what I'd like:
>k141_0_1 # 86 # 388 # -1 # ID=1_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.703
PSSSAVGGTVTGDTQGCWRVDELRLRGGDDAEWARVIETHSAIIESVLRRRVGDASMRAEVRDAVWARAFFEGLEPGEHAPVLPKELAEKPRLGGDRHRE*
>k141_964934_1 # 3 # 341 # -1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.699
AVSTCVYVYSLGYMDRDVEDRADPVRAPNVRRFFCLLDFFLFSMAILVLAGNIAVLLIGWTCVGLSSFLLISYWTGKPGTLSAGLQALAANAIGDAALLVALVLVPAGCGDLL
>k141_1688630_1 # 1 # 150 # 1 # ID=3_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.707
ALLLVLVLRVVHAYHSERLSDVADEEAELNARLEREEAPQHAEEEAAAL*
>k141_1688630_2 # 147 # 416 # 1 # ID=3_2;partial=01;start_type=GTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.748
MTALSIALLAPWAAGIVLVALDGRRRLIGWLAIGALFANLAGLTILAVSVLSDDPEVATTGNWPTGVGITLRADALGVLFALLSSPRAAR
>k141_361851_1 # 2 # 388 # 1 # ID=4_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.721
PATLTAAGGVFASGLSSRGRLVSGAAPKFYGNPLVAWTPAPGASAYEVQWSKTRYPFRPEKDPQNGNAFGRLTLGTSAVLPLRPGVWYYRVRGFSFALPTGAQQLSWSDPARIVVAKPTFRVVRRKHK*
>k141_241234_1 # 224 # 373 # -1 # ID=5_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.713
MVSLIGGLLTFTLGTGLVTWGAAVRGAMEHDGTLRGAGRLPQGASQEAS*
>k141_1206166_1 # 179 # 322 # -1 # ID=6_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
YQYAWTDLLGPTLVWDQVARGVLWSLAYSLVLYAAAWWHFLRKDVLS*
>k141_482468_1 # 123 # 314 # -1 # ID=7_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.615
AQFALRWILMNEAVSVVIPGARNPEQAIANTQASELPALSVNQMEAANAIFDRLIRPHVHQRW*
>k141_1447399_1 # 3 # 317 # -1 # ID=8_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.711
RTRPGSSPRGWFGPHLEALWTYLHEHHHISYARLEAIGRDLWHLAVSQGALANALRRTATRLRPEAGAIREQVRASPTIGSDETSARVNGRTHWQWVFQTPTASY
>k141_1_1 # 2 # 364 # 1 # ID=9_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
SYMFVTGPDVVKTVTHETVTQEELGGAVTHTTRSGVADLAFENDVEALLQLRRFMDFLPSSNREKPPVRPTWDSPDREEASLDTLIPANPNKPYDMKELILKVVDEGDFFEIQPTYARNII
>k141_964935_1 # 2 # 235 # 1 # ID=10_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.726
LADLSRDPDLRAELQRAVDRVNDGAAHHARIRRFSVVPRPFSADAGEITPTLKLKRRVIEERFADEIEALYAPLVRS*
Upvotes: 2
Views: 116
Reputation: 203229
Is this what you're trying to do?
$ awk '{printf "%s", (/^>/ ? s $0 ORS : $0); s=ORS} END{print ""}' file
>k141_0_1 # 86 # 388 # -1 # ID=1_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.703
PSSSAVGGTVTGDTQGCWRVDELRLRGGDDAEWARVIETHSAIIESVLRRRVGDASMRAEVRDAVWARAFFEGLEPGEHAPVLPKELAEKPRLGGDRHRE*
>k141_964934_1 # 3 # 341 # -1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.699
AVSTCVYVYSLGYMDRDVEDRADPVRAPNVRRFFCLLDFFLFSMAILVLAGNIAVLLIGWTCVGLSSFLLISYWTGKPGTLSAGLQALAANAIGDAALLVALVLVPAGCGDLL
>k141_1688630_1 # 1 # 150 # 1 # ID=3_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.707
ALLLVLVLRVVHAYHSERLSDVADEEAELNARLEREEAPQHAEEEAAAL*
>k141_1688630_2 # 147 # 416 # 1 # ID=3_2;partial=01;start_type=GTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.748
MTALSIALLAPWAAGIVLVALDGRRRLIGWLAIGALFANLAGLTILAVSVLSDDPEVATTGNWPTGVGITLRADALGVLFALLSSPRAAR
>k141_361851_1 # 2 # 388 # 1 # ID=4_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.721
PATLTAAGGVFASGLSSRGRLVSGAAPKFYGNPLVAWTPAPGASAYEVQWSKTRYPFRPEKDPQNGNAFGRLTLGTSAVLPLRPGVWYYRVRGFSFALPTGAQQLSWSDPARIVVAKPTFRVVRRKHK*
>k141_241234_1 # 224 # 373 # -1 # ID=5_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.713
MVSLIGGLLTFTLGTGLVTWGAAVRGAMEHDGTLRGAGRLPQGASQEAS*
>k141_1206166_1 # 179 # 322 # -1 # ID=6_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
YQYAWTDLLGPTLVWDQVARGVLWSLAYSLVLYAAAWWHFLRKDVLS*
>k141_482468_1 # 123 # 314 # -1 # ID=7_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.615
AQFALRWILMNEAVSVVIPGARNPEQAIANTQASELPALSVNQMEAANAIFDRLIRPHVHQRW*
>k141_1447399_1 # 3 # 317 # -1 # ID=8_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.711
RTRPGSSPRGWFGPHLEALWTYLHEHHHISYARLEAIGRDLWHLAVSQGALANALRRTATRLRPEAGAIREQVRASPTIGSDETSARVNGRTHWQWVFQTPTASY
>k141_1_1 # 2 # 364 # 1 # ID=9_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
SYMFVTGPDVVKTVTHETVTQEELGGAVTHTTRSGVADLAFENDVEALLQLRRFMDFLPSSNREKPPVRPTWDSPDREEASLDTLIPANPNKPYDMKELILKVVDEGDFFEIQPTYARNII
>k141_964935_1 # 2 # 235 # 1 # ID=10_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.726
LADLSRDPDLRAELQRAVDRVNDGAAHHARIRRFSVVPRPFSADAGEITPTLKLKRRVIEERFADEIEALYAPLVRS*
Please read the following for information on the main issues with the code you posted:
Upvotes: 2