Reputation: 161
I am trying to identify a particular word and then count it. I need to save the count for each identifier.
For example, a document may contain as below:
risk risk risk free interest rate
asterisk risk risk
market risk risk [risk
*I need to count 'risk' not asterisk. There could be other risk related words, so don't stick to the above example. What I need to find is 'risk'. If risk ends with or starts with anything like < [ ( or . ! * > ] ), etc.. I need to count it as well. But if risk word is a component of a word like asterisk, then I should not count it.
Here is what I have so far. However, it returns a count for asterisk and [risk as well as risk. I tried to use regular expression but keep getting errors. Plus, I am a beginner of Python. If anyone has any idea, please help me!!^^ Thanks.
from collections import defaultdict
word_dict=defaultdict(int)
for line in mylist:
words=line.lower().split() # converted all words to lower case
for word in words:
word_dict[word]+=1
for word in word_dict:
if 'risk' in word:
word, word_dict[word]
Upvotes: 1
Views: 187
Reputation: 10126
The regular expression (?<![a-zA-Z])risk(?![a-zA-Z])
should match "risk" if it's not preceded or followed by another letter. For example:
>>> len(re.findall('(?<![a-zA-Z])risk(?![a-zA-Z])','risk? 1risk asterisk risky'))
2
Here's the breakdown of this re:
(?<![a-zA-Z])
This negative lookbehind assertion says that the match will only happen if it is not preceded by a match for [a-zA-Z]
, which in turn just matches a letter.risk
This is the central re that matches "risk"; nothing fancy here...(?![a-zA-Z])
This is similar to the first part. It is a negative lookahead assertion that makes the match happen only if it is not followed by a letter.So, say you also don't want to match things like "1risk" that have numbers before them. You can just change the [a-zA-Z]
portion of the re to [a-zA-Z0-9]
. Eg.:
>>> len(re.findall('(?<![a-zA-Z0-9])risk(?![a-zA-Z0-9])','risk? 1risk asterisk risky'))
1
Update: In response to your question How to replace words, count a word, and save the count, I now get what you are asking for. You can use the same type of structure I have shown you, but modified to include all of these words:
There are a couple ways to modify the original re; the most intuitive is probably to just use the re OR |
and add in \-
to the negative lookahead to prevent matching on "risk-free" and such. For example:
>>> words = '|'.join(["risk","risked","riskier","riskiest","riskily","riskiness","risking","risks","risky"])
>>> len(re.findall('(?<![a-zA-Z])(%s)(?![a-zA-Z\-])' % words, 'risk? 1risk risky risk-free'))
3
Upvotes: 2
Reputation: 2297
It's actually quite easy to do this with regular expressions:
import re
haystack = "risk asterisk risk brisk risk"
prog = re.compile(r'\brisk\b')
result = re.findall(prog, haystack)
print len(result)
This outputs "3".
The \b regexp means any word delimiter including end/beginning of line.
Upvotes: 2