Reputation: 2054
I have a text file with several observations. Each observation is in one line. I would like to detect unique occurrence of each word in a line. In other words, if same word occurs twice or more on the same line, it is still counted as once. However, I would like to count the frequency of occurrence of each words across all observations. This means that if a word occurs in two or more lines,I would like to count the number of lines it occurred in. Here is the program I wrote and it is really slow in processing large number of file. I also remove certain words in the file by referencing another file. Please offer suggestions on how to improve speed. Thank you.
import re, string
from itertools import chain, tee, izip
from collections import defaultdict
def count_words(in_file="",del_file="",out_file=""):
d_list = re.split('\n', file(del_file).read().lower())
d_list = [x.strip(' ') for x in d_list]
dict2={}
f1 = open(in_file,'r')
lines = map(string.strip,map(str.lower,f1.readlines()))
for line in lines:
dict1={}
new_list = []
for char in line:
new_list.append(re.sub(r'[0-9#$?*_><@\(\)&;:,.!-+%=\[\]\-\/\^]', "_", char))
s=''.join(new_list)
for word in d_list:
s = s.replace(word,"")
for word in s.split():
try:
dict1[word]=1
except:
dict1[word]=1
for word in dict1.keys():
try:
dict2[word] += 1
except:
dict2[word] = 1
freq_list = dict2.items()
freq_list.sort()
f1.close()
word_count_handle = open(out_file,'w+')
for word, freq in freq_list:
print>>word_count_handle,word, freq
word_count_handle.close()
return dict2
dict = count_words("in_file.txt","delete_words.txt","out_file.txt")
Upvotes: 0
Views: 546
Reputation: 1469
import re
u_words = set()
u_words_in_lns = []
wordcount = {}
words = []
# get unique words per line
for line in buff.split('\n'):
u_words_in_lns.append(set(line.split(' ')))
# create a set of all unique words
map( u_words.update, u_words_in_lns )
# flatten the sets into a single list of words again
map( words.extend, u_words_in_lns)
# count everything up
for word in u_words:
wordcount[word] = len(re.findall(word,str(words)))
Upvotes: 0
Reputation: 49095
Without having done any performance testing, the following come to mind:
1) you're using regexes -- why? Are you just trying to get rid of certain characters?
2) you're using exceptions for flow control -- although it can be pythonic (better to ask forgiveness than permission), throwing exceptions can often be slow. As seen here:
for word in dict1.keys():
try:
dict2[word] += 1
except:
dict2[word] = 1
3) turn d_list
into a set, and use python's in
to test for membership, and simultaneously ...
4) avoid heavy use of replace
method on strings -- I believe you're using this to filter out the words that appear in d_list
. This could be accomplished instead by avoiding replace
, and just filtering the words in the line, either with a list comprehension:
[word for word words if not word in del_words]
or with a filter (not very pythonic):
filter(lambda word: not word in del_words, words)
Upvotes: 1
Reputation: 20654
You're running re.sub on each character of the line, one at a time. That's slow. Do it on the whole line:
s = re.sub(r'[0-9#$?*_><@\(\)&;:,.!-+%=\[\]\-\/\^]', "_", line)
Also, have a look at sets and the Counter class in the collections module. It may be faster if you just count and then discard those you don't want afterwards.
Upvotes: 1