DharmaTurtle
DharmaTurtle

Reputation: 8397

Counting occurrences of multiple strings in another string

In Python 2.7, given this string:

Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.

what would be the best way to find the sum amount of "Spot"s, "brown"s, and "hair"s in the string? In the example, it would return 8.

I'm looking for something like string.count("Spot","brown","hair") but works with with the "strings to be found" in a tuple or list.

Thanks!

Upvotes: 4

Views: 12901

Answers (2)

John La Rooy
John La Rooy

Reputation: 304225

This does what you asked for, but notice that it will also count words like "hairy", "browner" etc.

>>> s = "Spot is a brown dog. Spot has brown hair. The hair of Spot is brown."
>>> sum(s.count(x) for x in ("Spot", "brown", "hair"))
8

You can also write it as a map

>>> sum(map(s.count, ("Spot", "brown", "hair")))
8

A more robust solution might use the nltk package

>>> import nltk  # Natural Language Toolkit
>>> from collections import Counter
>>> sum(x in {"Spot", "brown", "hair"} for x in nltk.wordpunct_tokenize(s))
8

Upvotes: 14

mgilson
mgilson

Reputation: 309939

I might use a Counter:

s = 'Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.'
words_we_want = ("Spot","brown","hair")
from collections import Counter
data = Counter(s.split())
print (sum(data[word] for word in words_we_want))

Note that this will under-count by 1 since 'brown.' and 'brown' are separate Counter entries.

A slightly less elegant solution that doesn't trip up on punctuation uses a regex:

>>> len(re.findall('Spot|brown|hair','Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.'))
8

You can create the regex from a tuple simply by

'|'.join(re.escape(x) for x in words_we_want)

The nice thing about these solutions is that they have a much better algorithmic complexity compared to the solution by gnibbler. Of course, which actually performs better on real world data still needs to be measured by OP (since OP is the only one with the real world data)

Upvotes: 4

Related Questions