Python: using regex and a loop to clean mulitiple text files

Question

I'm cleaning newspaper articles stored in separated text files.

In one of the cleaning stages, I want to remove all the text within one file that comes after the deliminator 'LOAD-DATE:'. I use a small piece of code that does the work when applied to just one string. See below.

line = 'A little bit of text. LOAD-DATE: And some redundant text'

import re
m = re.match('(.*LOAD-DATE:)', line)
if m:
    line = m.group(1)
    line = re.sub('LOAD-DATE:', '', line)   
    print(line)

A little bit of text.

However, when I translate the code to a loop to clean a whole bunch of seperate text files (which works fine in other stages of the script), than it produces gigantic, identical text files, which don't look at all like the original newspaper articles. See loop:

files = glob.glob("*.txt")

for f in files:
    with open(f, "r") as fin: 
        try:
            import re
            m = re.match('(.*LOAD-DATE:)', fin)
            if m:
                data = m.group(1)
                data = re.sub('LOAD-DATE:', '', data)   
        except:
            pass

    with open(f, 'w') as fout:
        fout.writelines(data)

Something clearly goes wrong in the loop, but I have no idea what.

andrewlamb · Accepted Answer

Try going line by line through the file. Something like

import re

files = glob.glob("*.txt")

for f in files:
    with open(f, "r") as fin:
        data = []

        for line in fin:
            m = re.match('(.*LOAD-DATE:)', line)
            if m:
                line = m.group(1)
                line = re.sub('LOAD-DATE:', '', line)
            data.append(line)

    with open(f, 'w') as fout:
        fout.writelines(data)

Python: using regex and a loop to clean mulitiple text files

Answers (2)

Related Questions