pythonbeginner
pythonbeginner

Reputation: 31

How to specify amount of characters per line on a fasta file

I have a fasta file that looks like this:

>abc
AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTATGAATCCAGTA
TGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGCCGCCTCGTGTTTCACGGCCT
CAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTCACTTCTACTCTTGGCAGCAGTGGCCAGCTC
ATATGCCGCTGCACAAAGGAAACTGCTGACACCGGTGACAGTGCTTACTGCGGTTGTCACTTGTGAGTAC

However, I need the file to have 60 characters per line. It should look like this:

   >abc
    AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTAT
    GAATCCAGTATGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGC
    CGCCTCGTGTTTCACGGCCTCAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTC
    ACTTCTACTCTTGGCAGCAGTGGCCAGCTCATATGCCGCTGCACAAAGGAAACTGCTGAC

I tried to use fold -w 60 myfile.fasta > out.fa to change my file but the output is not what I expected. The output file looks like this:

>abc
AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTAT
GAATCCAGTA
TGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGCCGCCTCGTGT
TTCACGGCCT
CAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTCACTTCTACTCTTGGCAGCAG
TGGCCAGCTC
ATATGCCGCTGCACAAAGGAAACTGCTGACACCGGTGACAGTGCTTACTGCGGTTGTCAC
TTGTGAGTAC
ACACGCACCATTTACAATGCATGATGTTCGTGAGATTGATCTGTCTCTAACAGTTCACTT

Is there another way I can manipulate my fasta file to get it to the format I need?

Upvotes: 2

Views: 1282

Answers (4)

Supertech
Supertech

Reputation: 770

I have listed two options below. Hope it helps.

seqkit seq input.fasta -w 60

Instructions for seqkit is here.

And it's very easy to install using conda:
https://anaconda.org/bioconda/seqkit

A second option is to use Python's textwrap module. You could first parse the FASTA file with Biopython then use textwrap on the the string.

from Bio import SeqIO
import textwrap

for seq_record in SeqIO.parse("input.fasta", "fasta"):
    dna = str(seq_record.seq)
    fasta_record=textwrap.fill(dna, width=30)
    print(">",seq_record.id)
    print(fasta_record)

Upvotes: 1

Timur Shtatland
Timur Shtatland

Reputation: 12425

Do not reinvent the wheel. Use common bioinformatics tools, preferably open source tools. For example, you can use seqtk tool like so:

seqtk seq -l N infile > outfile

EXAMPLES:

$ echo ">seq1\nACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG\nACTG" | seqtk seq -l 60 
>seq1
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
ACTGACTG

$ echo ">seq1\nACTG\nACTG" | seqtk seq -l 60
>seq1
ACTGACTG

To install these tools, use conda, specifically miniconda, for example:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

REFERENCES:

seqtk: https://github.com/lh3/seqtk
conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

Upvotes: 2

choroba
choroba

Reputation: 242123

Perl to the rescue!

perl -lne 'sub out { print substr $buff, 0, 60, "" while $buff; print $_ }
           if (/^>/) { out() }
           else { $buff .= $_ }
           END { out() }
    ' file.fasta
  • -n reads the file line by line and runs the code for each line;
  • -l removes newlines from input and adds them to print;
  • We store the non-header lines in $buff;
  • When a header line (or end of file) arrives, we print the buffer 60 characters at a time.

Upvotes: 0

Axeltherabbit
Axeltherabbit

Reputation: 754

with python

with open("yourfile", "r") as f:
   text = f.read().split("\n",1)
   text[1] = text[1].replace("\n",'')
   text = text[0]+"\n"+"\n".join(" "+text[1][i*60:(i*60)+60] for i in range(len(text[1])//60))
   print(text)

Upvotes: 0

Related Questions