User54211
User54211

Reputation: 121

How to find the count/occurrence of one string(can be multi-word) in another string(sentence) in python

I have to count the occurrence of a string(which can be 1 or more words) in another string (which is a sentence) and should not be case-sensitive.

For instance -

a = "Hi my name is Alex and hi to you as well. How high is the building? The highest floor is 18th. Highlights .... She said hi as well. Do you know highlights of the match ... hi."

b = "hi" #word/sentence to find count of

I tried -

a.lower().count(b) 

which returns

>> 8 

while the required answer should be

>> 4.

For multi-word, this method seems to work but I am not sure of the limiting cases. How can I fix this?

Upvotes: 1

Views: 153

Answers (4)

Setop
Setop

Reputation: 2490

"Spliting" a sentence into words is not trivial.

There in a package in python to do that : nltk.

First install this package using pip or system specific package manager.

Then run ipython and use nltk.download() to download "punkt" data : type d then type punkt. Then quit q.

Then use

tokens = nltk.word_tokenize(a)
len(list(filter(lambda x: x.lower() == b, tokens))

it returns 4.

Upvotes: 0

Prune
Prune

Reputation: 77837

The function works just fine: the sequence "hi" appears 8 times in the string. Since you want it only as words, you'll need to figure out how you can differentiate the word "hi" from the incidental appearance in other words, such as "chipper".

One common way is to use the re package (regular expressions), but that may be more learning then you want to do right now.

A better way at the moment would be to split the string into words before you check each:

word_list = a.lower().split()
b_count = word_list.count(b)

Note that this considers only spaces when dividing words. It still won't find "hi" in "hi-performance", for example. You'd need another split operation for other separators.

Upvotes: 0

Moses Koledoye
Moses Koledoye

Reputation: 78556

You can use re.findall to search for the substring with leading and trailing word boundaries:

import re

print(len(re.findall(r'\b{}\b'.format(b), a, re.I))) # -> 4
#                      ^   ^
#                      |___|_ word boundaries  ^
#                                              |_ case insensitive match

Upvotes: 3

Ajax1234
Ajax1234

Reputation: 71451

Use str.split() and filter out punctuation with regex:

import re
a = "Hi my name is Alex and hi to you as well. How high is the building? The highest floor is 18th. Highlights .... She said hi as well. Do you know highlights of the match ... hi."
b = "hi"
final_count = sum(re.sub("\W+", '', i.lower()) == b for i in a.split())

Output:

4

Upvotes: -1

Related Questions