Reputation: 101
I have a file like the following:
SCN DD1251
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1271 C DD1271 R
DD1351 D DD1351 B
E
SCN DD1271
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1301 T DD1301 A
DD1251 R DD1251 C
SCN DD1301
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1271 A DD1271 T
B
C
D
SCN DD1351
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
A DD1251 D
DD1251 B
C
SCN DD1451
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
A
B
C
SCN DD1601
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
A
B
C
D
SCN GA0101
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
B GC4251 D
GC420A C GA127A S
GA127A T
SCN GA0151
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
C GA0401 R G
GA0201 D GC0051 E H
GA0401 B GA0201 W
GC0051 A
Where the gap between each record has a newline character followed by 81 spaces.
I have created the following regex expression using regex101.com which seems to match the gaps between each record:
\s{81}\n
Combined with the short loop below to open the file and then write each section to a new file:
delimiter_pattern = re.compile(r"\s{81}\n")
with open("Junctions.txt", "r") as f:
i = 1
for line in f:
if delimiter_pattern.match(line) == False:
output = open('%d.txt' % i,'w')
output.write(line)
else:
i+=1
However, instead of outputting, say 2.txt as expected below:
SCN DD1271
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1301 T DD1301 A
DD1251 R DD1251 C
It instead seems to return nothing at all. I have tried modifying the code like so:
with open("Clean-Junction-Links1.txt", "r") as f:
i = 1
output = open('%d.txt' % i,'w')
for line in f:
if delimiter_pattern.match(line) == False:
output.write(line)
else:
i+=1
But this instead returns several hundred blank text files.
What is the issue with my code, and how could I modify it to make it work? Failing that, is there a simpler way to split the file on the blank lines without using regex?
Upvotes: 0
Views: 1042
Reputation: 123541
You don't need to use a regex to do this because you can detect the gap between blocks easily by using the string strip()
method.
input_file = 'Clean-Junction-Links1.txt'
with open(input_file, 'r') as file:
i = 0
output = None
for line in file:
if not line.strip(): # Blank line?
if output:
output.close()
output = None
else:
if output is None:
i += 1
print(f'Creating file "{i}.txt"')
output = open(f'{i}.txt','w')
output.write(line)
if output:
output.close()
print('-fini-')
Another, cleaner and more modular, way to implement it would be to divide the processing up into two independent tasks that logically have very little to do with each other:
The first can be implemented as a generator function which iteratively collects and yields groups of lines comprising a record. It's the one named extract_records()
below.
input_file = 'Clean-Junction-Links1.txt'
def extract_records(filename):
with open(filename, 'r') as file:
lines = []
for line in file:
if line.strip(): # Not blank?
lines.append(line)
else:
yield lines
lines = []
if lines:
yield lines
for i, record in enumerate(extract_records(input_file), start=1):
print(f'Creating file {i}.txt')
with open(f'{i}.txt', 'w') as output:
output.write(''.join(record))
print('-fini-')
Upvotes: 2
Reputation: 2145
\s
captures spaces and newline, so it's 80 spaces plus one newline to get {81}. You can't get a second newline when iterating line-by-line, for line in f
, unless you put in extra logic to account for that. Also, match()
returns None, not False.
#! /usr/bin/env python3
import re
delimiter_pattern = re .compile( r'\s{81}' )
with open( 'Junctions.txt', 'r' ) as f:
i = 1
for line in f:
if delimiter_pattern .match( line ) == None:
output = open( f'{i}.txt', 'a+' )
output .write( line )
else:
i += 1
Upvotes: 1
Reputation: 202
A few things.
The single text file is being produced since you do not open a file for writing in the loop, you open one single one before the loop begins.
Based on your desired output, you do not want to match the regular expression on each line, but rather you want to continue reading the file until you obtain a single record.
I have put together a working solution
with open("Junctions.txt", "r") as f:
#read file and split on 80 spaces followed by new line
file = f.read()
sep = " " * 80 + "\n"
chunks = file.split(sep)
#for each chunk of the file write to a txt file
i = 0
for chunk in chunks:
with open('%d.txt' % i, 'w') as outFile:
outFile.write(chunk)
i += 1
this will take the file and get a list of all the groups you want by finding the one separator (80 spaces followed by new line)
Upvotes: 1
Reputation: 6063
You are getting blank output because you are checking whether a line matches a bunch of whitespace (\s{81}\n
) and if there is a match, you are writing only that (blank) line. You need to instead print each line as it is read, and then jump to a new file when your pattern matches.
Also, when you use for line in f
, the \n
character is stripped out, so your regex will not match.
import re
delimiter_pattern = re.compile(r"\s{81}")
with open("Junctions.txt", "r") as f:
fileNum = 1
output = open(f'{fileNum}.txt','w') # f-strings require Python 3.6 but are cleaner
for line in f:
if not delimiter_pattern.match(line):
output.write(line)
else:
output.close()
fileNum += 1
output = open(f'{fileNum}.txt','w')
# Close last file
if not output.closed:
output.close()
Upvotes: 1