d_kennetz
d_kennetz

Reputation: 5359

Unable to remove line breaks in a text file in python

At the risk of losing reputation I did not know what else to do. My file is not showing any hidden characters and I have tried every .replace and .strip I can think of. My file is UTF-8 encoded and I am using python/3.6.1 I have a file with the format:

 >header1
 AAAAAAAA
 TTTTTTTT
 CCCCCCCC
 GGGGGGGG

 >header2
 CCCCCC
 TTTTTT
 GGGGGG
 AAAAAA

I am trying to remove line breaks from the end of the file to make each line a continuous string. (This file is actually thousands of lines long). My code is redundant in the sense that I typed in everything I could think of to remove line breaks:

 fref = open(ref)
 for line in fref:
     sequence = 0
     header = 0
     if line.startswith('>'):
          header = ''.join(line.splitlines())
          print(header)
     else:
          sequence = line.strip("\n").strip("\r")
          sequence = line.replace('\n', ' ').replace('\r', '').replace(' ', '').replace('\t', '')
          print(len(sequence))

output is:

 >header1
 8
 8
 8
 8
 >header2
 6
 6
 6
 6

But if I manually go in and delete the end of line to make it a continuous string it shows it as a congruent string.

Expected output:

 >header1
 32
 >header2
 24     

Thanks in advance for any help, Dennis

Upvotes: 0

Views: 1053

Answers (2)

cr3
cr3

Reputation: 461

There are several approaches to parsing this kind of input. In all cases, I would recommend isolating the open and print side-effects outside of a function that you can unit test to convince yourself of the proper behavior.

You could iterate over each line and handle the case of empty lines and end-of-file separately. Here, I use yield statements to return the values:

def parse(infile):
    for line in infile:
        if line.startswith(">"):
            total = 0
            yield line.strip()
        elif not line.strip():
            yield total
        else:
            total += len(line.strip())
    if line.strip():
        yield total

def test_parse(func):
    with open("input.txt") as infile:
        assert list(parse(infile)) == [
            ">header1",
            32,
            ">header2",
            24,
        ]

Or, you could handle both empty lines and end-of-file at the same time. Here, I use an output array to which I append headers and totals:

def parse(infile):
    output = []
    while True:
        line = infile.readline()
        if line.startswith(">"):
            total = 0
            header = line.strip()
        elif line and line.strip():
            total += len(line.strip())
        else:
            output.append(header)
            output.append(total)
            if not line:
                break

    return output

def test_parse(func):
    with open("input.txt") as infile:
        assert parse(infile) == [
            ">header1",
            32,
            ">header2",
            24,
        ]

Or, you could also split the whole input file into empty-line-separated blocks and parse them independently. Here, I use an output stream to which I write the output; in production, you could pass the sys.stdout stream for example:

import re
def parse(infile, outfile):
    content = infile.read()
    for block in re.split(r"\r?\n\r?\n", content):
        header, *lines = re.split(r"\s+", block)
        total = sum(len(line) for line in lines)
        outfile.write("{header}\n{total}\n".format(
            header=header,
            total=total,
        ))

from io import StringIO
def test_parse(func): 
    with open("/tmp/a.txt") as infile: 
        outfile = StringIO() 
        parse(infile, outfile) 
        outfile.seek(0) 
        assert outfile.readlines() == [ 
            ">header1\n", 
            "32\n", 
            ">header2\n", 
            "24\n", 
        ]

Note that my tests use open("input.txt") for brevity but I would actually recommend passing a StringIO(...) instance instead to see the input being tested more easily, to avoid hitting the filesystem and to make the tests faster.

Upvotes: 2

PeterE
PeterE

Reputation: 5855

From my understanding of your question you would like something like this: Note how the sequence is build over multiple iteration steps of the loop, as you wish to combine multiple lines.

with open(ref) as f:
    sequence = "" # reset sequence
    header = None
    for line in f:
        if line.startswith('>'):
            if header:
                print(header)        # print last header
                print(len(sequence)) # print last sequence
            sequence = ""      # reset sequence
            header = line[1:]  # store header
        else:
            sequence += line.rstrip()   # append line to sequence

Upvotes: 1

Related Questions