Reputation: 146
I really apologize if this has been answered before but I have been scouring SO and Google for a couple of hours now on how to properly do this. It should be easy and I know I am missing something simple.
I am trying to read from a file and count all occurrences of elements from a list. This list is not just whole words though. It has special characters and punctuation that I need to get as well.
This is what I have so far, I have been trying various ways and this post got me the closest: Python - Finding word frequencies of list of words in text file
So I have a file that contains a couple of paragraphs and my list of strings is:
listToCheck = ['the','The ','the,','the;','the!','the\'','the.','\'the']
My full code is:
#!/usr/bin/python
import re
from collections import Counter
f = open('text.txt','r')
wanted = ['the','The ','the,','the;','the!','the\'','the.','\'the']
words = re.findall('\w+', f.read().lower())
cnt = Counter()
for word in words:
if word in wanted:
print word
cnt[word] += 1
print cnt
my output thus far looks like:
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
Counter({'the': 17})
It is counting my "the" strings with punctuation but not counting them as separate counters. I know it is because of the \W+. I am just not sure what the proper regex pattern to use here or if I'm going about this the wrong way.
Upvotes: 0
Views: 1674
Reputation: 11347
The simplest option is to combine all "wanted" strings into one regular expression:
rr = '|'.join(map(re.escape, wanted))
and then find all matches in the text using re.findall
.
To make sure longer stings match first, just sort the wanted
list by length:
wanted.sort(key=len, reverse=True)
rr = '|'.join(map(re.escape, wanted))
Upvotes: 0
Reputation: 2464
I suspect there may be some extra details to your specific problem that you are not describing here for simplicity. However, I'll assume that what you are looking for is to find a given word, e.g. "the", which could have either an upper or lower case first letter, and can be preceded and followed either by a whitespace or by some punctuation characters such as ;,.!'. You want to count the number of all the distinct instances of this general pattern.
I would define a single (non-disjunctive) regular expression that define this. Something like this
import re
pattern = re.compile(r"[\s',;.!][Tt]he[\s.,;'!]")
(That might not be exactly what you are looking for in general. I just assuming it is based on what you stated above. )
Now, let's say our text is
text = '''
Foo and the foo and ;the, foo. The foo 'the and the;
and the' and the; and foo the, and the. foo.
'''
We could do
matches = pattern.findall(text)
where matches will be
[' the ',
';the,',
' The ',
"'the ",
' the;',
" the'",
' the;',
' the,',
' the.']
And then you just count.
from collections import Counter
count = Counter()
for match in matches:
count[match] += 1
which in this case would lead to
Counter({' the;': 2, ' the.': 1, ' the,': 1, " the'": 1, ' The ': 1, "'the ": 1, ';the,': 1, ' the ': 1})
As I said at the start, this might not be exactly what you want, but hopefully you could modify this to get what you want.
Just to add, a difficulty with using a disjunctive regular expression like
'the|the;|the,|the!'
is that the strings like "the," and "the;" will also match the first option, i.e. "the", and that will be returned as the match. Even though this problem could be avoided by more careful ordering of the options, I think it might not be easier in general.
Upvotes: 1