Reputation: 5044

Count occurrences of a couple of specific words

I have a list of words, lets say: ["foo", "bar", "baz"] and a large string in which these words may occur.

I now use for every word in the list the "string".count("word") method. This works OK, but seems rather inefficient. For every extra word added to the list the entire string must be iterated over an extra time.

Is their any better method to do this, or should I implement a custom method which iterates over the large string a single time, checking for each character if one of the words in the list has been reached?

To be clear:

I want the number of occurrences per word in the list.
The string to search in is different each time and consists of about 10000 chars
The list of words is constant
The words in the list of words can contain whitespace

Upvotes: 7

Answers (4)

Justas

Reputation: 159

The Counter method doesn't work well for large vocabularies. In example below CountVectorizer is many times faster:

import time
import random

longstring = ["foo", "bar", "baz", "qux", "thud"] * 100000
random.shuffle(longstring)
longstring = " ".join(longstring)
vocab = ["foo bar", "baz"] + ["nothing"+str(i) for i in range(100000)]

Testing:

import re
from collections import Counter

tic = time.time()
r = re.compile("|".join(r"\b%s\b" % w for w in vocab))
wordcount = Counter(re.findall(r, longstring))
print(time.time() - tic)

870 seconds

from sklearn.feature_extraction.text import CountVectorizer
from numpy import array

tic = time.time()
vectorized = CountVectorizer(vocabulary=vocab, ngram_range=(1, 2)).fit([longstring])  # list strings contains 1 to 2 words
counts = vectorized.transform([longstring])
counts = array(counts.sum(axis=0))[0]
wordcount = {vocab[i]: counts[i] for i in range(len(vocab))}
print(time.time() - tic)

1.17 seconds

Upvotes: 1

Fred Foo

Reputation: 363838

Make a dict-typed frequency table for your words, then iterate over the words in your string.

vocab = ["foo", "bar", "baz"]
s = "foo bar baz bar quux foo bla bla"

wordcount = dict((x,0) for x in vocab)
for w in re.findall(r"\w+", s):
    if w in wordcount:
        wordcount[w] += 1

Edit: if the "words" in your list contain whitespace, you can instead build an RE out of them:

from collections import Counter

vocab = ["foo bar", "baz"]
r = re.compile("|".join(r"\b%s\b" % w for w in vocab))
wordcount = Counter(re.findall(r, s))

Explanation: this builds the RE r'\bfoo bar\b|\bbaz\b' from the vocabulary. findall then finds the list ['baz', 'foo bar'] and the Counter (Python 2.7+) counts the occurrence of each distinct element in it. Watch out that your list of words should not contain characters that are special to REs, such as ()[]\.

Upvotes: 9

Gareth Latty

Reputation: 89097

Presuming the words need to be found separately (that is, you want to count words as made by str.split()):

Edit: as suggested in the comments, a Counter is a good option here:

from collections import Counter

def count_many(needles, haystack):
    count = Counter(haystack.split())
    return {key: count[key] for key in count if key in needles}

Which runs as so:

count_many(["foo", "bar", "baz"], "testing somefoothing foo bar baz bax foo foo foo bar bar test bar test")
{'baz': 1, 'foo': 4, 'bar': 4}

Note that in Python <= 2.6(?) you will need to use return dict((key, count[key]) for key in count if key in needles) due to the lack of dict comprehensions.

Of course, another option is to simply return the whole Counter object and only get the values you need when you need them, as it may not be a problem to have the extra values, depending on the situation.

Old answer:

from collections import defaultdict

def count_many(needles, haystack):
    count = defaultdict(int)
    for word in haystack.split():
        if word in needles:
            count[word] += 1
    return count

Which results in:

count_many(["foo", "bar", "baz"], "testing somefoothing foo bar baz bax foo foo foo bar bar test bar test")
defaultdict(<class 'int'>, {'baz': 1, 'foo': 4, 'bar': 4})

If you greatly object to getting a defaultdict back (which you shouldn't, as it functions exactly the same as a dict when accessing), then you can do return dict(count) instead to get a normal dictionary.

Upvotes: 3

Senthil Kumaran

Reputation: 56951

How long is your string and I understand that it is not constantly changing as your list of string is?

A good idea is to iterate over the words in the string and have dictionary for the words and increment the count for each word. With this in place. You can then looking for the word in the list in the dictionary and output it's value which is the number of occurrence.