Reputation: 163
from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
with open(fp) as fh:
return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
import re
import string
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
for name in files:
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
wordlist = list(uniquewords)
This code counts the total number of unique and total words. However, the problem is, if I write len(uniquewords) , it shows unreasonable number because it recognizes for example, 'shake' 'shake!' 'shake,' and 'shake?' as different unique words. I've tried to remove punctuations from uniquewords by making the list and modifying it, everything failed. Can anybody help me?
Upvotes: 0
Views: 243
Reputation: 15345
\w+
pattern to match words and exclude punctuation.collections.Counter
The example data to this code is appended at the end:
import re
from collections import Counter
pattern = re.compile(r'\w+')
with open('data') as f:
text = f.read()
print Counter(pattern.findall(text))
gives:
Counter(
{'in': 4, 'the': 4, 'string': 3, 'matches': 3, 'are': 2,
'pattern': 2, '2': 2, 'and': 1, 'all': 1, 'finditer': 1,
'iterator': 1, 'over': 1, 'an': 1, 'instances': 1,
'scanned': 1, 'right': 1, 'RE': 1, 'another': 1, 'touch': 1,
'New': 1, 'to': 1, 'returned': 1, 'Return': 1, 'for': 1,
'0': 1, 're': 1, 'version': 1, 'Empty': 1, 'is': 1,
'match': 1, 'non': 1, 'unless': 1, 'overlapping': 1, 'they': 1, 'included': 1, 'The': 1, 'beginning': 1, 'MatchObject': 1,
'result': 1, 'of': 1, 'yielding': 1, 'flags': 1, 'found': 1,
'order': 1, 'left': 1})
data:
re.finditer(pattern, string, flags=0) Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match. New in version 2.2.
Upvotes: 1