Reputation: 121
I have to count the occurrence of a string(which can be 1 or more words) in another string (which is a sentence) and should not be case-sensitive.
For instance -
a = "Hi my name is Alex and hi to you as well. How high is the building? The highest floor is 18th. Highlights .... She said hi as well. Do you know highlights of the match ... hi."
b = "hi" #word/sentence to find count of
I tried -
a.lower().count(b)
which returns
>> 8
while the required answer should be
>> 4.
For multi-word, this method seems to work but I am not sure of the limiting cases. How can I fix this?
Upvotes: 1
Views: 153
Reputation: 2490
"Spliting" a sentence into words is not trivial.
There in a package in python to do that : nltk.
First install this package using pip or system specific package manager.
Then run ipython and use nltk.download()
to download "punkt" data : type d
then type punkt
. Then quit q
.
Then use
tokens = nltk.word_tokenize(a)
len(list(filter(lambda x: x.lower() == b, tokens))
it returns 4.
Upvotes: 0
Reputation: 77837
The function works just fine: the sequence "hi" appears 8 times in the string. Since you want it only as words, you'll need to figure out how you can differentiate the word "hi" from the incidental appearance in other words, such as "chipper".
One common way is to use the re
package (regular expressions), but that may be more learning then you want to do right now.
A better way at the moment would be to split the string into words before you check each:
word_list = a.lower().split()
b_count = word_list.count(b)
Note that this considers only spaces when dividing words. It still won't find "hi" in "hi-performance", for example. You'd need another split operation for other separators.
Upvotes: 0
Reputation: 78556
You can use re.findall
to search for the substring with leading and trailing word boundaries:
import re
print(len(re.findall(r'\b{}\b'.format(b), a, re.I))) # -> 4
# ^ ^
# |___|_ word boundaries ^
# |_ case insensitive match
Upvotes: 3
Reputation: 71451
Use str.split()
and filter out punctuation with regex:
import re
a = "Hi my name is Alex and hi to you as well. How high is the building? The highest floor is 18th. Highlights .... She said hi as well. Do you know highlights of the match ... hi."
b = "hi"
final_count = sum(re.sub("\W+", '', i.lower()) == b for i in a.split())
Output:
4
Upvotes: -1