Tyler Wenzel
Tyler Wenzel

Reputation: 11

Replacing only some line endings with a tab delimiter

I currently have a FASTA file with several DNA sequences in it.

The lines alternative between a descriptor: “>\w{4}\d{6}” And a DNA sequence file - a line of 300+ random capitalized alphabetic letters.

I am trying to make each sequence tab delimited, so that each descriptor and sequence is on a single line, separated by a tab. The following is what I have tried:

from __future__ import print_function
import re
import sys

Fasta_seq = open(sys.argv[1])
for a_line in Fasta_seq:
  if re.search('^>.+', a_line):
     re.sub('.+\n', '.+\t', a_line)
     print(a_line, end='')
  else:
    re.sub('.+', '.+', a_line)
    print(a_line, end='\n')

However, this code does not seem to delete the line ending at the end of my descriptor. It simply returns to me the exact same output.

Does anyone have an idea of what I am overlooking?

Upvotes: 1

Views: 41

Answers (1)

ODiogoSilva
ODiogoSilva

Reputation: 2414

I'm not sure if you are dealing with leave or interleave fasta, but this task can be easily done without regular expressions (also, use 4 space indents). Try the following:

Fasta_seq = open(sys.argv[1])
output_file = open("outfile.txt", "w")

seq = ""

for a_line in Fasta_seq:

    if a_line.startswith(">"):

        # Do this only when a sequence has been populated
        if seq:
            output_file.write("{}\t{}\n".format(header, seq))

        header = a_line.strip()
        seq = ""

    else:
        seq += a_line.strip()

This should work in both leave and interleave fasta inputs

Upvotes: 1

Related Questions