J0hn
J0hn

Reputation: 570

How can I format a txt file in python so that extra paragraph lines are removed as well as extra blank spaces?

I'm trying to format a file similar to this: (random.txt)

        Hi,    im trying   to format  a new txt document so
that extra     spaces between    words   and paragraphs   are only 1.



   This should make     this txt document look like:

This is how it should look below: (randomoutput.txt)

Hi, I'm trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.

This should make this txt document look like:

So far the code I've managed to make has only removed the spaces, but I'm having trouble making it recognize where a new paragraph starts so that it doesn't remove the blank lines between paragraphs. This is what I have so far.

def removespaces(input, output):
    ivar = open(input, 'r')
    ovar = open(output, 'w')
    n = ivar.read()
    ovar.write(' '.join(n.split()))
    ivar.close()
    ovar.close()

Edit:

I've also found a way to create spaces between paragraphs, but right now it just takes every line break and creates a space between the old line and new line using:

m = ivar.readlines()
m[:] = [i for i in m if i != '\n']
ovar.write('\n'.join(m))

Upvotes: 4

Views: 2621

Answers (6)

m.wasowski
m.wasowski

Reputation: 6386

Basically, you want to take lines that are non-empty (so line.strip() returns empty string, which is a False in boolean context). You can do this using list/generator comprehension on result of str.splitlines(), with if clause to filterout empty lines.

Then for each line you want to ensure, that all words are separated by single space - for this you can use ' '.join() on result of str.split().

So this should do the job for you:

compressed = '\n'.join(
    ' '.join(line.split()) for line in txt.splitlines() 
        if line.strip() 
    )

or you can use filter and map with helper function to make it maybe more readable:

def squash_line(line):
    return ' '.join(line.split())

non_empty_lines = filter(str.strip, txt.splitlines())
compressed = '\n'.join(map(squash_line, non_empty_lines))

Upvotes: 0

pradyunsg
pradyunsg

Reputation: 19466

Firstly, let's see, what exactly is the problem... You cannot have 1+ consecutive spaces or 2+ consecutive newlines.

You know how to handle 1+ spaces. That approach won't work on 2+ newlines as there are 3 possible situations: - 1 newline - 2 newlines - 2+ newlines

Great so.. How do you solve this then? There are many solutions. I'll list 3 of them.

  1. Regex based. This problem is very easy to solve iff1 you know how to use regex... So, here's the code:

    s = re.sub(r'\n{2,}', r'\n\n', in_file.read())
    

    If you have memory constraints, this is not the best way as we read the entire file into the momory.

  2. While loop based. This code is really self-explainatory, but I wrote this line anyway...

    s = in_file.read()
    while "\n\n\n" in s:
        s = s.replace("\n\n\n", "\n\n")
    

    Again, you have memory constraints, we still read the entire file into the momory.

  3. State based. Another way to approach this problem is line-by-line. By keeping track whether the last line we encountered was blank, we can decide what to do.

    was_last_line_blank = False
    for line in in_file:
        # Uncomment if you consider lines with only spaces blank
        # line = line.strip()
    
        if not line:
            was_last_line_blank = True
            continue
        if not was_last_line_blank:
            # Add a new line to output file
            out_file.write("\n")
        # Write contents of `line` in file
        out_file.write(line)
    
        was_last_line_blank = False
    

Now, 2 of them need you to load the entire file into memory, the other one is fairly more complicated. My point is: All these work but since there is a small difference in ow they work, what they need on the system varies...

1 The "iff" is intentional.

Upvotes: 0

5gon12eder
5gon12eder

Reputation: 25459

You should process the input line-by line. Not only will this make your program simpler but also more easy on the system's memory.

The logic for normalizing horizontal white space in a line stays the same (split words and join with a single space).

What you'll need to do for the paragraphs is test whether line.strip() is empty (just use it as a boolean expression) and keep a flag whether the previous line was empty too. You simply throw away the empty lines but if you encounter a non-empty line and the flag is set, print a single empty line before it.

with open('input.txt', 'r') as istr:
    new_par = False
    for line in istr:
        line = line.strip()
        if not line:  # blank
            new_par = True
            continue
        if new_par:
            print()  # print a single blank line
        print(' '.join(line.split()))
        new_par = False

If you want to suppress blank lines at the top of the file, you'll need an extra flag that you set only after encountering the first non-blank line.

If you want to go more fancy, have a look at the textwrap module but be aware that is has (or, at least, used to have, from what I can say) some bad worst-case performance issues.

Upvotes: 2

abarnert
abarnert

Reputation: 366003

The trick here is that you want to turn any sequence of 2 or more \n into exactly 2 \n characters. This is hard to write with just split and join—but it's dead simple to write with re.sub:

n = re.sub(r'\n\n+', r'\n\n', n)

If you want lines with nothing but spaces to be treated as blank lines, do this after stripping spaces; if you want them to be treated as non-blank, do it before.

You probably also want to change your space-stripping code to use split(' ') rather than just split(), so it doesn't screw up newlines. (You could also use re.sub for that as well, but it isn't really necessary, because turning 1 or more spaces into exactly 1 isn't hard to write with split and join.)


Alternatively, you could just go line by line, and keep track of the last line (either with an explicit variable inside the loop, or by writing a simple adjacent_pairs iterator, like i1, i2 = tee(ivar); next(i2); return zip_longest(i1, i2, fillvalue='')) and if the current line and the previous line are both blank, don't write the current line.

Upvotes: 1

jftuga
jftuga

Reputation: 1963

To fix the paragraph issue:

import re
data = open("data.txt").read()

result = re.sub("[\n]+", "\n\n", data)
print(result)

Upvotes: -1

sweber
sweber

Reputation: 2996

split without Argument will cut your string at each occurence if a whitespace ( space, tab, new line,...). Write n.split(" ") and it will only split at spaces. Instead of writing the output to a file, put it Ingo a New variable, and repeat the step again, this time with

m.split("\n")

Upvotes: 0

Related Questions