Reputation: 9
I have the following program and I want to find for example the string 'light pink' in my text file!
I use word==' '.join(['light','pink'])
and it doesn't works.
from operator import itemgetter
def mmetric1(file):
words_gen = (word.lower() for line in open("test.txt")
for word in line.split())
words = {}
for word in words_gen:
if (word=='aqua')or(word=='azure')or(word=='black')or(word=='light pink'):
words[word] = words.get(word, 0) + 1
top_words = sorted(words.items(), key=itemgetter(1))
for word, frequency in top_words:
print ("%s : %d" % (word, frequency))
Upvotes: 0
Views: 2627
Reputation: 43437
Your entire approach is wrong.
It seems to me you want to check if a set of strings exist in your file. You should use regular expressions.
Here:
from collections import Counter
import re
def mmetric1(file_path, desired):
finder = re.compile(re.escape('(%s)' % '|'.join(desired)), re.MULTILINE)
with open(file_path) as f:
return Counter(finder.findall(f))
# have a list of the strings you want to find
desired = ['aqua', 'azure', 'black', 'light pink']
# run the method
mmetric1(file_path, desired)
If you are worried about large files, and performance, you can iterate over the lines in the file:
def mmetric1(file_path, desired):
results = Counter()
finder = re.compile(re.escape('(%s)' % '|'.join(desired)))
with open(file_path) as f:
for line in f:
Counter.update(finder.findall(line))
return results
To print these results as you have your own:
for word, frequency in mmetric1(file_path, desired).items():
print ("%s : %d" % (word, frequency))
Upvotes: 1
Reputation: 63707
When you split a string, its splits based on whitespace, which includes space character
So later, there would be no possibility for you to match consecutive words in the manner you are proposing to peruse except IF
Example Code
try:
while True:
word = next(words_gen)
if any(word == token for token in ['aqua', 'azure', 'black']) \
or (word == 'light' and next(word) == 'pink'):
words[word] = words.get(word, 0) + 1
except StopIteration:
pass
Not a good option, if you are searching a huge file
Upvotes: 0
Reputation: 60681
You have already split the entire line into separate words:
for word in line.split()
So there is no single word in words_gen
which contains the text light pink
. It instead contains light
and pink
as two separate words, along with all the other words on that line.
You need a different approach; have a look at regular expressions.
Upvotes: 1