Reputation: 570
I'm trying to format a file similar to this: (random.txt)
Hi, im trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
This is how it should look below: (randomoutput.txt)
Hi, I'm trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
So far the code I've managed to make has only removed the spaces, but I'm having trouble making it recognize where a new paragraph starts so that it doesn't remove the blank lines between paragraphs. This is what I have so far.
def removespaces(input, output):
ivar = open(input, 'r')
ovar = open(output, 'w')
n = ivar.read()
ovar.write(' '.join(n.split()))
ivar.close()
ovar.close()
Edit:
I've also found a way to create spaces between paragraphs, but right now it just takes every line break and creates a space between the old line and new line using:
m = ivar.readlines()
m[:] = [i for i in m if i != '\n']
ovar.write('\n'.join(m))
Upvotes: 4
Views: 2621
Reputation: 6386
Basically, you want to take lines that are non-empty (so line.strip()
returns empty string, which is a False
in boolean context). You can do this using list/generator comprehension on result of str.splitlines()
, with if
clause to filterout empty lines.
Then for each line you want to ensure, that all words are separated by single space - for this you can use ' '.join()
on result of str.split()
.
So this should do the job for you:
compressed = '\n'.join(
' '.join(line.split()) for line in txt.splitlines()
if line.strip()
)
or you can use filter
and map
with helper function to make it maybe more readable:
def squash_line(line):
return ' '.join(line.split())
non_empty_lines = filter(str.strip, txt.splitlines())
compressed = '\n'.join(map(squash_line, non_empty_lines))
Upvotes: 0
Reputation: 19466
Firstly, let's see, what exactly is the problem... You cannot have 1+ consecutive spaces or 2+ consecutive newlines.
You know how to handle 1+ spaces. That approach won't work on 2+ newlines as there are 3 possible situations: - 1 newline - 2 newlines - 2+ newlines
Great so.. How do you solve this then? There are many solutions. I'll list 3 of them.
Regex based. This problem is very easy to solve iff1 you know how to use regex... So, here's the code:
s = re.sub(r'\n{2,}', r'\n\n', in_file.read())
If you have memory constraints, this is not the best way as we read the entire file into the momory.
While loop based. This code is really self-explainatory, but I wrote this line anyway...
s = in_file.read()
while "\n\n\n" in s:
s = s.replace("\n\n\n", "\n\n")
Again, you have memory constraints, we still read the entire file into the momory.
State based. Another way to approach this problem is line-by-line. By keeping track whether the last line we encountered was blank, we can decide what to do.
was_last_line_blank = False
for line in in_file:
# Uncomment if you consider lines with only spaces blank
# line = line.strip()
if not line:
was_last_line_blank = True
continue
if not was_last_line_blank:
# Add a new line to output file
out_file.write("\n")
# Write contents of `line` in file
out_file.write(line)
was_last_line_blank = False
Now, 2 of them need you to load the entire file into memory, the other one is fairly more complicated. My point is: All these work but since there is a small difference in ow they work, what they need on the system varies...
1 The "iff" is intentional.
Upvotes: 0
Reputation: 25459
You should process the input line-by line. Not only will this make your program simpler but also more easy on the system's memory.
The logic for normalizing horizontal white space in a line stays the same (split words and join with a single space).
What you'll need to do for the paragraphs is test whether line.strip()
is empty (just use it as a boolean expression) and keep a flag whether the previous line was empty too. You simply throw away the empty lines but if you encounter a non-empty line and the flag is set, print a single empty line before it.
with open('input.txt', 'r') as istr:
new_par = False
for line in istr:
line = line.strip()
if not line: # blank
new_par = True
continue
if new_par:
print() # print a single blank line
print(' '.join(line.split()))
new_par = False
If you want to suppress blank lines at the top of the file, you'll need an extra flag that you set only after encountering the first non-blank line.
If you want to go more fancy, have a look at the textwrap
module but be aware that is has (or, at least, used to have, from what I can say) some bad worst-case performance issues.
Upvotes: 2
Reputation: 366003
The trick here is that you want to turn any sequence of 2 or more \n
into exactly 2 \n
characters. This is hard to write with just split
and join
—but it's dead simple to write with re.sub
:
n = re.sub(r'\n\n+', r'\n\n', n)
If you want lines with nothing but spaces to be treated as blank lines, do this after stripping spaces; if you want them to be treated as non-blank, do it before.
You probably also want to change your space-stripping code to use split(' ')
rather than just split()
, so it doesn't screw up newlines. (You could also use re.sub
for that as well, but it isn't really necessary, because turning 1 or more spaces into exactly 1 isn't hard to write with split
and join
.)
Alternatively, you could just go line by line, and keep track of the last line (either with an explicit variable inside the loop, or by writing a simple adjacent_pairs iterator, like i1, i2 = tee(ivar); next(i2); return zip_longest(i1, i2, fillvalue='')
) and if the current line and the previous line are both blank, don't write the current line.
Upvotes: 1
Reputation: 1963
To fix the paragraph issue:
import re
data = open("data.txt").read()
result = re.sub("[\n]+", "\n\n", data)
print(result)
Upvotes: -1
Reputation: 2996
split without Argument will cut your string at each occurence if a whitespace ( space, tab, new line,...). Write n.split(" ") and it will only split at spaces. Instead of writing the output to a file, put it Ingo a New variable, and repeat the step again, this time with
m.split("\n")
Upvotes: 0