Rens
Rens

Reputation: 512

Python: using regex and a loop to clean mulitiple text files

I'm cleaning newspaper articles stored in separated text files.

In one of the cleaning stages, I want to remove all the text within one file that comes after the deliminator 'LOAD-DATE:'. I use a small piece of code that does the work when applied to just one string. See below.

line = 'A little bit of text. LOAD-DATE: And some redundant text'

import re
m = re.match('(.*LOAD-DATE:)', line)
if m:
    line = m.group(1)
    line = re.sub('LOAD-DATE:', '', line)   
    print(line)   

A little bit of text.

However, when I translate the code to a loop to clean a whole bunch of seperate text files (which works fine in other stages of the script), than it produces gigantic, identical text files, which don't look at all like the original newspaper articles. See loop:

files = glob.glob("*.txt")

for f in files:
    with open(f, "r") as fin: 
        try:
            import re
            m = re.match('(.*LOAD-DATE:)', fin)
            if m:
                data = m.group(1)
                data = re.sub('LOAD-DATE:', '', data)   
        except:
            pass

    with open(f, 'w') as fout:
        fout.writelines(data) 

Something clearly goes wrong in the loop, but I have no idea what.

Upvotes: 0

Views: 488

Answers (2)

johnashu
johnashu

Reputation: 2211

I made 10 txt files all containing the string:

 A little bit of text. LOAD-DATE: And some redundant text

I changed the m variable as patrick suggested to allow the file to be opened and read.

   m = re.match('(.*LOAD-DATE:)', fin.read())

But I also found that I had to include the writelines inside the if statement

        if m:
            data = m.group(1)
            data = re.sub('LOAD-DATE:', '', data)   
            with open(f, 'w') as fout:
                fout.writelines(data) 

It changed them all no problem and very quickly.

I hope this helps.

Upvotes: 0

andrewlamb
andrewlamb

Reputation: 131

Try going line by line through the file. Something like

import re

files = glob.glob("*.txt")

for f in files:
    with open(f, "r") as fin:
        data = []

        for line in fin:
            m = re.match('(.*LOAD-DATE:)', line)
            if m:
                line = m.group(1)
                line = re.sub('LOAD-DATE:', '', line)
            data.append(line)

    with open(f, 'w') as fout:
        fout.writelines(data)

Upvotes: 1

Related Questions