Parse a 90MB data archive in Python

Question

I'm a newbie to Python and trying to build a program that will allow me to parse several hundred documents by speaker and their speech (data is hearing transcripts of a semi-regular structure). After parsing, I write the results into a .csv file, then write another file that parses speech into paragraphs and makes another .csv. Here is the code (Acknowledgements to my colleague on his part in development of this, which was massive):

import os
import re
import csv
from bs4 import BeautifulSoup

path = "path in computer"
os.chdir(path)


with open('hearing_name.htm', 'r') as f:
        hearing = f.read()

Hearing = BeautifulSoup(hearing)
Hearing = Hearing.get_text()
Hearing = Hearing.split("erroneous text")


speakers = re.findall("\n    Mr. [A-Z][a-z]+\.|\n    Ms. [A-Z][a-z]+\.|\n    Congressman [A-Z][a-z]+\.|\n   Congresswoman [A-Z][a-z]+\.|\n   Chairwoman [A-Z][a-z]+\.|\n   Chairman [A-Z][a-z]+\.", hearing)
speakers = list(set(speakers))

print speakers

position = []
for speaker in speakers:
        x = hearing.find(speaker)
        position.append(x)

def find_speaker(hearing, speakers):
        position = []
        for speaker in speakers:
                x = hearing.find(speaker)
                if x==-1:
                        x += 1000000
                position.append(x)
        first = min(position)
        name = speakers[position.index(min(position))]
        name_length = len(name)
        chunk = [name, hearing[0:first], hearing[first+name_length:]]
        return chunk

chunks = []

print hearing
names = []
while len(hearing)>10:
        chunk_try = find_speaker(hearing, speakers)
        hearing = chunk_try[2]
        chunks.append(chunk_try[1])
        names.append(chunk_try[0].strip())

print len(hearing)#0

#print dialogue[0:5]

chunks.append(hearing)        
chunks = chunks[1:]
print len(names) #138
print len(chunks) #138

data = zip(names, chunks)


with open('filename.csv','wb') as f:
    w=csv.writer(f)
    w.writerow(['Speaker','Speach'])
    for row in data:
        w.writerow(row)


paragraphs = str(chunks)
print (paragraphs)


Paragraphs = paragraphs.split("\n")

data1 = zip(Paragraphs)

with open('Paragraphs.csv','wb') as f:
    w=csv.writer(f)
    w.writerow(['Paragraphs'])
    for row in data1:
        w.writerow(row)

Obviously, the code above can do what I need one hearing at a time, but my question is how can I automate this to the point were I can either do large batches or all of the files at once (578 hearings in total)? I've tried the below (which has worked for me in the past when compiling large sets of data), but this time I get no results (memory leak?)

Tested Compiling Code:

hearing = [filename for filename in os.listdir(path)]

hearings = []

#compile hearings
for file in hearing:
    input = open(file, 'r')
    hearings.append(input.read())

Thanks in advance for your help.

martineau · Accepted Answer

First you need to take the first set of code, generalize it and make it into a giant function. The will involve replacing any hardcoded path and file names in it with variables named appropriately.

Give the new driver function argments that correspond to each of the path(s) and file name(s) you replaced. Calling this function will preform all the steps need to process one input file and produce all the output files that result from doing that.

You can test whether you've done this correctly by calling the driver function and passing it the file names that it used to be hardcoded and see if it produces the same output as it did before.

Once that is done, import the file the function is in (which is now called a module) into your batch processing script and invoke the new driver function you added multiple times, passing different input and output file names to it each time.

I've done the first step for you (and fixed the mixed indenting). Note however that it's untested since that's impossible for me to actually do:

import os
import re
import csv
from bs4 import BeautifulSoup

def driver(folder, input_filename, output_filename1, output_filename2):
    os.chdir(folder)
    with open(input_filename, 'r') as f:
        hearing = f.read()

    Hearing = BeautifulSoup(hearing)
    Hearing = Hearing.get_text()
    Hearing = Hearing.split("erroneous text")

    speakers = re.findall("\n    Mr. [A-Z][a-z]+\.|\n    Ms. [A-Z][a-z]+\.|\n    Congressman [A-Z][a-z]+\.|\n   Congresswoman [A-Z][a-z]+\.|\n   Chairwoman [A-Z][a-z]+\.|\n   Chairman [A-Z][a-z]+\.", hearing)
    speakers = list(set(speakers))

    print speakers

    position = []
    for speaker in speakers:
        x = hearing.find(speaker)
        position.append(x)

    def find_speaker(hearing, speakers):
        position = []
        for speaker in speakers:
            x = hearing.find(speaker)
            if x==-1:
                x += 1000000
            position.append(x)
        first = min(position)
        name = speakers[position.index(min(position))]
        name_length = len(name)
        chunk = [name, hearing[0:first], hearing[first+name_length:]]
        return chunk

    chunks = []

    print hearing
    names = []
    while len(hearing)>10:
        chunk_try = find_speaker(hearing, speakers)
        hearing = chunk_try[2]
        chunks.append(chunk_try[1])
        names.append(chunk_try[0].strip())

    print len(hearing)#0

    #print dialogue[0:5]

    chunks.append(hearing)
    chunks = chunks[1:]
    print len(names) #138
    print len(chunks) #138

    data = zip(names, chunks)

    with open(output_filename1,'wb') as f:
        w=csv.writer(f)
        w.writerow(['Speaker','Speach'])
        for row in data:
            w.writerow(row)

    paragraphs = str(chunks)
    print (paragraphs)

    Paragraphs = paragraphs.split("\n")

    data1 = zip(Paragraphs)

    with open(output_filename2,'wb') as f:
        w=csv.writer(f)
        w.writerow(['Paragraphs'])
        for row in data1:
            w.writerow(row)

    return True  # success

if __name__ == '__main__':
    driver('path in computer', 'hearing_name.htm', 'filename.csv', 'Paragraphs.csv')

Parse a 90MB data archive in Python

Answers (2)

Related Questions