Reputation: 512
I'm cleaning newspaper articles stored in separated text files.
In one of the cleaning stages, I want to remove all the text within one file that comes after the deliminator 'LOAD-DATE:'. I use a small piece of code that does the work when applied to just one string. See below.
line = 'A little bit of text. LOAD-DATE: And some redundant text'
import re
m = re.match('(.*LOAD-DATE:)', line)
if m:
line = m.group(1)
line = re.sub('LOAD-DATE:', '', line)
print(line)
A little bit of text.
However, when I translate the code to a loop to clean a whole bunch of seperate text files (which works fine in other stages of the script), than it produces gigantic, identical text files, which don't look at all like the original newspaper articles. See loop:
files = glob.glob("*.txt")
for f in files:
with open(f, "r") as fin:
try:
import re
m = re.match('(.*LOAD-DATE:)', fin)
if m:
data = m.group(1)
data = re.sub('LOAD-DATE:', '', data)
except:
pass
with open(f, 'w') as fout:
fout.writelines(data)
Something clearly goes wrong in the loop, but I have no idea what.
Upvotes: 0
Views: 488
Reputation: 2211
I made 10 txt files all containing the string:
A little bit of text. LOAD-DATE: And some redundant text
I changed the m
variable as patrick suggested to allow the file to be opened and read.
m = re.match('(.*LOAD-DATE:)', fin.read())
But I also found that I had to include the writelines
inside the if statement
if m:
data = m.group(1)
data = re.sub('LOAD-DATE:', '', data)
with open(f, 'w') as fout:
fout.writelines(data)
It changed them all no problem and very quickly.
I hope this helps.
Upvotes: 0
Reputation: 131
Try going line by line through the file. Something like
import re
files = glob.glob("*.txt")
for f in files:
with open(f, "r") as fin:
data = []
for line in fin:
m = re.match('(.*LOAD-DATE:)', line)
if m:
line = m.group(1)
line = re.sub('LOAD-DATE:', '', line)
data.append(line)
with open(f, 'w') as fout:
fout.writelines(data)
Upvotes: 1