Zenvega
Zenvega

Reputation: 2054

Python: counting unique instance of words across several lines

I have a text file with several observations. Each observation is in one line. I would like to detect unique occurrence of each word in a line. In other words, if same word occurs twice or more on the same line, it is still counted as once. However, I would like to count the frequency of occurrence of each words across all observations. This means that if a word occurs in two or more lines,I would like to count the number of lines it occurred in. Here is the program I wrote and it is really slow in processing large number of file. I also remove certain words in the file by referencing another file. Please offer suggestions on how to improve speed. Thank you.

import re, string
from itertools import chain, tee, izip
from collections import defaultdict

def count_words(in_file="",del_file="",out_file=""):

    d_list = re.split('\n', file(del_file).read().lower())
    d_list = [x.strip(' ') for x in d_list] 

    dict2={}
    f1 = open(in_file,'r')
    lines = map(string.strip,map(str.lower,f1.readlines()))

    for line in lines:
        dict1={}
        new_list = []
        for char in line:
            new_list.append(re.sub(r'[0-9#$?*_><@\(\)&;:,.!-+%=\[\]\-\/\^]', "_", char))
        s=''.join(new_list)
        for word in d_list:
            s = s.replace(word,"")
        for word in s.split():
            try:
                dict1[word]=1
            except:
                dict1[word]=1
        for word in dict1.keys():
            try:
                dict2[word] += 1
            except:
                dict2[word] = 1
    freq_list = dict2.items()
    freq_list.sort()
    f1.close()

    word_count_handle = open(out_file,'w+')
    for word, freq  in freq_list:
        print>>word_count_handle,word, freq
    word_count_handle.close()
    return dict2

 dict = count_words("in_file.txt","delete_words.txt","out_file.txt")

Upvotes: 0

Views: 546

Answers (3)

pyInTheSky
pyInTheSky

Reputation: 1469

import re

u_words        = set()
u_words_in_lns = []
wordcount      = {}
words          = []

# get unique words per line
for line in buff.split('\n'):
    u_words_in_lns.append(set(line.split(' ')))

# create a set of all unique words
map( u_words.update, u_words_in_lns )

# flatten the sets into a single list of words again
map( words.extend, u_words_in_lns)

# count everything up
for word in u_words:
    wordcount[word] = len(re.findall(word,str(words)))

Upvotes: 0

Matt Fenwick
Matt Fenwick

Reputation: 49095

Without having done any performance testing, the following come to mind:

1) you're using regexes -- why? Are you just trying to get rid of certain characters?

2) you're using exceptions for flow control -- although it can be pythonic (better to ask forgiveness than permission), throwing exceptions can often be slow. As seen here:

    for word in dict1.keys():
        try:
            dict2[word] += 1
        except:
            dict2[word] = 1

3) turn d_list into a set, and use python's in to test for membership, and simultaneously ...

4) avoid heavy use of replace method on strings -- I believe you're using this to filter out the words that appear in d_list. This could be accomplished instead by avoiding replace, and just filtering the words in the line, either with a list comprehension:

[word for word words if not word in del_words]

or with a filter (not very pythonic):

filter(lambda word: not word in del_words, words)

Upvotes: 1

MRAB
MRAB

Reputation: 20654

You're running re.sub on each character of the line, one at a time. That's slow. Do it on the whole line:

s = re.sub(r'[0-9#$?*_><@\(\)&;:,.!-+%=\[\]\-\/\^]', "_", line)

Also, have a look at sets and the Counter class in the collections module. It may be faster if you just count and then discard those you don't want afterwards.

Upvotes: 1

Related Questions