user3314492
user3314492

Reputation: 241

Most frequent word in one file which is not found in another file using Python

I am trying to write a program where I count the most frequently used words from one file but those words should not be available in another file. So basically I am reading data from test.txt file and counting the most frequently used word from that file, but that word should not be found in test2.txt file.

Below are sample data files, test.txt and test2.txt

test.txt:

The Project is for testing. doing some testing to find what's going on. the the the.

test2.txt:

a
about
above
across
after
afterwards
again
against
the

Below is my script, which parses files test.txt and test2.txt. It finds the most frequently used words from test.txt, excluding words found in test2.txt.

I thought I was doing everything right, but when I execute my script, it gives "the" as the most frequent word. But actually, the result should be "testing", as "the" is found in test2.txt but "testing" is not found in test2.txt.

from collections import Counter
import re

dgWords = re.findall(r'\w+', open('test.txt').read().lower())

f = open('test2.txt', 'rb')
sWords = [line.strip() for line in f]

print(len(dgWords));

for sWord in sWords:
    print (sWord)
    print (dgWords) 
    while sWord in dgWords: dgWords.remove(sWord)   

print(len(dgWords));
mostFrequentWord = Counter(dgWords).most_common(1)
print (mostFrequentWord)

Upvotes: 1

Views: 87

Answers (3)

inspectorG4dget
inspectorG4dget

Reputation: 113975

import re
from collections import Counter

with open('test.txt') as testfile, open('test2.txt') as stopfile:
    stopwords = set(line.strip() for line in stopfile)
    words = Counter(re.findall(r'\w+', open('test.txt').read().lower()))
    for word in stopwords:
        if word in words:
            words.pop(word)
    print("the most frequent word is", words.most_common(1))

Upvotes: 0

Bill Huang
Bill Huang

Reputation: 4648

I simply changed the following line of your original code

f = open('test2.txt', 'rb')

to

f = open('test2.txt', 'r')

and it worked. Simply read your text as string instead of binaries. Otherwise they won't match in regex. Tested on python 3.4 eclipse PyDev Win7 x64.

OFFTOPIC:

It's more pythonic to open files using with statements. In this case, write

with open('test2.txt', 'r') as f:

and indent file processing statements accordingly. That should keep you away from forgetting to close the filestream.

Upvotes: 0

gabhijit
gabhijit

Reputation: 3585

Here's how I'd go about it - using sets

all_words = re.findall(r'\w+', open('test.txt').read().lower())

f = open('test2.txt', 'rb')
stop_words = [line.strip() for line in f]

set_all = set(all_words)
set_stop = set(stop_words)

all_only = set_all - set_stop

print Counter(filter(lambda w:w in all_only, all_words)).most_common(1)

This should be slightly faster as well as you do a counter on only 'all_only' words

Upvotes: 1

Related Questions