Jimmy
Jimmy

Reputation: 161

How to find a particular type of word and count it

I am trying to identify a particular word and then count it. I need to save the count for each identifier.

For example, a document may contain as below:

risk risk risk free interest rate 

asterisk risk risk 

market risk risk [risk

*I need to count 'risk' not asterisk. There could be other risk related words, so don't stick to the above example. What I need to find is 'risk'. If risk ends with or starts with anything like < [ ( or . ! * > ] ), etc.. I need to count it as well. But if risk word is a component of a word like asterisk, then I should not count it.

Here is what I have so far. However, it returns a count for asterisk and [risk as well as risk. I tried to use regular expression but keep getting errors. Plus, I am a beginner of Python. If anyone has any idea, please help me!!^^ Thanks.

from collections import defaultdict
word_dict=defaultdict(int)

for line in mylist:
    words=line.lower().split()  # converted all words to lower case
    for word in words:
        word_dict[word]+=1

for word in word_dict:
    if 'risk' in word:
       word, word_dict[word]

Upvotes: 1

Views: 187

Answers (3)

Matthew Adams
Matthew Adams

Reputation: 10126

The regular expression (?<![a-zA-Z])risk(?![a-zA-Z]) should match "risk" if it's not preceded or followed by another letter. For example:

>>> len(re.findall('(?<![a-zA-Z])risk(?![a-zA-Z])','risk? 1risk asterisk risky'))
2

Here's the breakdown of this re:

  • (?<![a-zA-Z]) This negative lookbehind assertion says that the match will only happen if it is not preceded by a match for [a-zA-Z], which in turn just matches a letter.
  • risk This is the central re that matches "risk"; nothing fancy here...
  • (?![a-zA-Z]) This is similar to the first part. It is a negative lookahead assertion that makes the match happen only if it is not followed by a letter.

So, say you also don't want to match things like "1risk" that have numbers before them. You can just change the [a-zA-Z] portion of the re to [a-zA-Z0-9]. Eg.:

>>> len(re.findall('(?<![a-zA-Z0-9])risk(?![a-zA-Z0-9])','risk? 1risk asterisk risky'))
1

Update: In response to your question How to replace words, count a word, and save the count, I now get what you are asking for. You can use the same type of structure I have shown you, but modified to include all of these words:

  • risk
  • risked
  • riskier
  • riskiest
  • riskily
  • riskiness
  • risking
  • risks
  • risky

There are a couple ways to modify the original re; the most intuitive is probably to just use the re OR | and add in \- to the negative lookahead to prevent matching on "risk-free" and such. For example:

>>> words = '|'.join(["risk","risked","riskier","riskiest","riskily","riskiness","risking","risks","risky"])
>>> len(re.findall('(?<![a-zA-Z])(%s)(?![a-zA-Z\-])' % words, 'risk? 1risk risky risk-free'))
3

Upvotes: 2

antun
antun

Reputation: 2297

It's actually quite easy to do this with regular expressions:

import re
haystack = "risk asterisk risk brisk risk"
prog = re.compile(r'\brisk\b')
result = re.findall(prog, haystack)
print len(result)

This outputs "3".

The \b regexp means any word delimiter including end/beginning of line.

Upvotes: 2

PasteBT
PasteBT

Reputation: 2198

if 'risk' == word:
    print word, word_dict[word]

Upvotes: 0

Related Questions