Reputation: 241
I am trying to write a program where I count the most frequently used words from one file but those words should not be available in another file. So basically I am reading data from test.txt file and counting the most frequently used word from that file, but that word should not be found in test2.txt file.
Below are sample data files, test.txt and test2.txt
test.txt:
The Project is for testing. doing some testing to find what's going on. the the the.
test2.txt:
a
about
above
across
after
afterwards
again
against
the
Below is my script, which parses files test.txt and test2.txt. It finds the most frequently used words from test.txt, excluding words found in test2.txt.
I thought I was doing everything right, but when I execute my script, it gives "the" as the most frequent word. But actually, the result should be "testing", as "the" is found in test2.txt but "testing" is not found in test2.txt.
from collections import Counter
import re
dgWords = re.findall(r'\w+', open('test.txt').read().lower())
f = open('test2.txt', 'rb')
sWords = [line.strip() for line in f]
print(len(dgWords));
for sWord in sWords:
print (sWord)
print (dgWords)
while sWord in dgWords: dgWords.remove(sWord)
print(len(dgWords));
mostFrequentWord = Counter(dgWords).most_common(1)
print (mostFrequentWord)
Upvotes: 1
Views: 87
Reputation: 113975
import re
from collections import Counter
with open('test.txt') as testfile, open('test2.txt') as stopfile:
stopwords = set(line.strip() for line in stopfile)
words = Counter(re.findall(r'\w+', open('test.txt').read().lower()))
for word in stopwords:
if word in words:
words.pop(word)
print("the most frequent word is", words.most_common(1))
Upvotes: 0
Reputation: 4648
I simply changed the following line of your original code
f = open('test2.txt', 'rb')
to
f = open('test2.txt', 'r')
and it worked. Simply read your text as string instead of binaries. Otherwise they won't match in regex. Tested on python 3.4 eclipse PyDev Win7 x64.
OFFTOPIC:
It's more pythonic to open files using with statements. In this case, write
with open('test2.txt', 'r') as f:
and indent file processing statements accordingly. That should keep you away from forgetting to close the filestream.
Upvotes: 0
Reputation: 3585
Here's how I'd go about it - using sets
all_words = re.findall(r'\w+', open('test.txt').read().lower())
f = open('test2.txt', 'rb')
stop_words = [line.strip() for line in f]
set_all = set(all_words)
set_stop = set(stop_words)
all_only = set_all - set_stop
print Counter(filter(lambda w:w in all_only, all_words)).most_common(1)
This should be slightly faster as well as you do a counter on only 'all_only' words
Upvotes: 1