user2314768
user2314768

Reputation: 9

Count occurences of strings in a text file

I have the following program and I want to find for example the string 'light pink' in my text file! I use word==' '.join(['light','pink']) and it doesn't works.

from operator import itemgetter

def mmetric1(file):
    words_gen = (word.lower() for line in open("test.txt")
                                             for word in line.split())
    words = {}

    for word in words_gen:
        if (word=='aqua')or(word=='azure')or(word=='black')or(word=='light pink'):
            words[word] = words.get(word, 0) + 1

    top_words = sorted(words.items(), key=itemgetter(1))

    for word, frequency in top_words:
       print ("%s : %d" % (word, frequency))

Upvotes: 0

Views: 2627

Answers (3)

Inbar Rose
Inbar Rose

Reputation: 43437

Your entire approach is wrong.

It seems to me you want to check if a set of strings exist in your file. You should use regular expressions.

Here:

from collections import Counter
import re

def mmetric1(file_path, desired):
    finder = re.compile(re.escape('(%s)' % '|'.join(desired)), re.MULTILINE)
    with open(file_path) as f:
        return Counter(finder.findall(f))

# have a list of the strings you want to find
desired = ['aqua', 'azure', 'black', 'light pink']
# run the method
mmetric1(file_path, desired)

If you are worried about large files, and performance, you can iterate over the lines in the file:

def mmetric1(file_path, desired):
    results = Counter()
    finder = re.compile(re.escape('(%s)' % '|'.join(desired)))
    with open(file_path) as f:
        for line in f:
            Counter.update(finder.findall(line))
    return results

To print these results as you have your own:

for word, frequency in mmetric1(file_path, desired).items():
    print ("%s : %d" % (word, frequency))

Upvotes: 1

Abhijit
Abhijit

Reputation: 63707

When you split a string, its splits based on whitespace, which includes space character

So later, there would be no possibility for you to match consecutive words in the manner you are proposing to peruse except IF

  • You wan't to modify your loop

Example Code

try:
   while True:
        word = next(words_gen)
       if any(word == token for token in ['aqua', 'azure', 'black']) \
          or (word == 'light' and  next(word) == 'pink'):
            words[word] = words.get(word, 0) + 1 
except StopIteration:
    pass
  • Use Regex

Not a good option, if you are searching a huge file

  • Use some other data-structure like prefix Tree

Upvotes: 0

Graham Borland
Graham Borland

Reputation: 60681

You have already split the entire line into separate words:

for word in line.split()

So there is no single word in words_gen which contains the text light pink. It instead contains light and pink as two separate words, along with all the other words on that line.

You need a different approach; have a look at regular expressions.

Upvotes: 1

Related Questions