Unable to remove line breaks in a text file in python

Question

At the risk of losing reputation I did not know what else to do. My file is not showing any hidden characters and I have tried every .replace and .strip I can think of. My file is UTF-8 encoded and I am using python/3.6.1 I have a file with the format:

 >header1
 AAAAAAAA
 TTTTTTTT
 CCCCCCCC
 GGGGGGGG

 >header2
 CCCCCC
 TTTTTT
 GGGGGG
 AAAAAA

I am trying to remove line breaks from the end of the file to make each line a continuous string. (This file is actually thousands of lines long). My code is redundant in the sense that I typed in everything I could think of to remove line breaks:

 fref = open(ref)
 for line in fref:
     sequence = 0
     header = 0
     if line.startswith('>'):
          header = ''.join(line.splitlines())
          print(header)
     else:
          sequence = line.strip("
").strip("
")
          sequence = line.replace('
', ' ').replace('
', '').replace(' ', '').replace('	', '')
          print(len(sequence))

output is:

 >header1
 8
 8
 8
 8
 >header2
 6
 6
 6
 6

But if I manually go in and delete the end of line to make it a continuous string it shows it as a congruent string.

Expected output:

 >header1
 32
 >header2
 24

Thanks in advance for any help, Dennis

cr3 · Accepted Answer

There are several approaches to parsing this kind of input. In all cases, I would recommend isolating the open and print side-effects outside of a function that you can unit test to convince yourself of the proper behavior.

You could iterate over each line and handle the case of empty lines and end-of-file separately. Here, I use yield statements to return the values:

def parse(infile):
    for line in infile:
        if line.startswith(">"):
            total = 0
            yield line.strip()
        elif not line.strip():
            yield total
        else:
            total += len(line.strip())
    if line.strip():
        yield total

def test_parse(func):
    with open("input.txt") as infile:
        assert list(parse(infile)) == [
            ">header1",
            32,
            ">header2",
            24,
        ]

Or, you could handle both empty lines and end-of-file at the same time. Here, I use an output array to which I append headers and totals:

def parse(infile):
    output = []
    while True:
        line = infile.readline()
        if line.startswith(">"):
            total = 0
            header = line.strip()
        elif line and line.strip():
            total += len(line.strip())
        else:
            output.append(header)
            output.append(total)
            if not line:
                break

    return output

def test_parse(func):
    with open("input.txt") as infile:
        assert parse(infile) == [
            ">header1",
            32,
            ">header2",
            24,
        ]

Or, you could also split the whole input file into empty-line-separated blocks and parse them independently. Here, I use an output stream to which I write the output; in production, you could pass the sys.stdout stream for example:

import re
def parse(infile, outfile):
    content = infile.read()
    for block in re.split(r"
?

?
", content):
        header, *lines = re.split(r"\s+", block)
        total = sum(len(line) for line in lines)
        outfile.write("{header}
{total}
".format(
            header=header,
            total=total,
        ))

from io import StringIO
def test_parse(func): 
    with open("/tmp/a.txt") as infile: 
        outfile = StringIO() 
        parse(infile, outfile) 
        outfile.seek(0) 
        assert outfile.readlines() == [ 
            ">header1
", 
            "32
", 
            ">header2
", 
            "24
", 
        ]

Note that my tests use open("input.txt") for brevity but I would actually recommend passing a StringIO(...) instance instead to see the input being tested more easily, to avoid hitting the filesystem and to make the tests faster.

Unable to remove line breaks in a text file in python

Answers (2)

Related Questions