Reputation: 31
I have a fasta file that looks like this:
>abc
AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTATGAATCCAGTA
TGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGCCGCCTCGTGTTTCACGGCCT
CAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTCACTTCTACTCTTGGCAGCAGTGGCCAGCTC
ATATGCCGCTGCACAAAGGAAACTGCTGACACCGGTGACAGTGCTTACTGCGGTTGTCACTTGTGAGTAC
However, I need the file to have 60 characters per line. It should look like this:
>abc
AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTAT
GAATCCAGTATGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGC
CGCCTCGTGTTTCACGGCCTCAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTC
ACTTCTACTCTTGGCAGCAGTGGCCAGCTCATATGCCGCTGCACAAAGGAAACTGCTGAC
I tried to use fold -w 60 myfile.fasta > out.fa to change my file but the output is not what I expected. The output file looks like this:
>abc
AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTAT
GAATCCAGTA
TGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGCCGCCTCGTGT
TTCACGGCCT
CAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTCACTTCTACTCTTGGCAGCAG
TGGCCAGCTC
ATATGCCGCTGCACAAAGGAAACTGCTGACACCGGTGACAGTGCTTACTGCGGTTGTCAC
TTGTGAGTAC
ACACGCACCATTTACAATGCATGATGTTCGTGAGATTGATCTGTCTCTAACAGTTCACTT
Is there another way I can manipulate my fasta file to get it to the format I need?
Upvotes: 2
Views: 1282
Reputation: 770
I have listed two options below. Hope it helps.
seqkit seq input.fasta -w 60
Instructions for seqkit is here.
And it's very easy to install using conda:
https://anaconda.org/bioconda/seqkit
A second option is to use Python's textwrap module. You could first parse the FASTA file with Biopython then use textwrap on the the string.
from Bio import SeqIO
import textwrap
for seq_record in SeqIO.parse("input.fasta", "fasta"):
dna = str(seq_record.seq)
fasta_record=textwrap.fill(dna, width=30)
print(">",seq_record.id)
print(fasta_record)
Upvotes: 1
Reputation: 12425
Do not reinvent the wheel. Use common bioinformatics tools, preferably open source tools. For example, you can use seqtk
tool like so:
seqtk seq -l N infile > outfile
EXAMPLES:
$ echo ">seq1\nACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG\nACTG" | seqtk seq -l 60
>seq1
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
ACTGACTG
$ echo ">seq1\nACTG\nACTG" | seqtk seq -l 60
>seq1
ACTGACTG
To install these tools, use conda
, specifically miniconda
, for example:
conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate
REFERENCES:
seqtk
: https://github.com/lh3/seqtk
conda
: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
Upvotes: 2
Reputation: 242123
Perl to the rescue!
perl -lne 'sub out { print substr $buff, 0, 60, "" while $buff; print $_ }
if (/^>/) { out() }
else { $buff .= $_ }
END { out() }
' file.fasta
-n
reads the file line by line and runs the code for each line;-l
removes newlines from input and adds them to print
;$buff
;Upvotes: 0
Reputation: 754
with python
with open("yourfile", "r") as f:
text = f.read().split("\n",1)
text[1] = text[1].replace("\n",'')
text = text[0]+"\n"+"\n".join(" "+text[1][i*60:(i*60)+60] for i in range(len(text[1])//60))
print(text)
Upvotes: 0