rocksland
rocksland

Reputation: 163

Removing punctuations from list items using Python

from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
    with open(fp) as fh:
        return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
import re
import string
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
    for name in files:
        [uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
wordlist = list(uniquewords)

This code counts the total number of unique and total words. However, the problem is, if I write len(uniquewords) , it shows unreasonable number because it recognizes for example, 'shake' 'shake!' 'shake,' and 'shake?' as different unique words. I've tried to remove punctuations from uniquewords by making the list and modifying it, everything failed. Can anybody help me?

Upvotes: 0

Views: 243

Answers (1)

tzelleke
tzelleke

Reputation: 15345

  1. Use Regex with \w+ pattern to match words and exclude punctuation.
  2. When counting in Python use collections.Counter

The example data to this code is appended at the end:

import re
from collections import Counter

pattern = re.compile(r'\w+')

with open('data') as f:
    text = f.read()

print Counter(pattern.findall(text))

gives:

Counter(
{'in': 4, 'the': 4, 'string': 3, 'matches': 3, 'are': 2,
'pattern': 2, '2': 2, 'and': 1, 'all': 1, 'finditer': 1,
'iterator': 1, 'over': 1, 'an': 1, 'instances': 1,
'scanned': 1, 'right': 1, 'RE': 1, 'another': 1, 'touch': 1,
'New': 1, 'to': 1, 'returned': 1, 'Return': 1, 'for': 1,
'0': 1, 're': 1, 'version': 1, 'Empty': 1, 'is': 1,
'match': 1, 'non': 1, 'unless': 1, 'overlapping': 1, 'they': 1, 'included': 1, 'The': 1, 'beginning': 1, 'MatchObject': 1,
'result': 1, 'of': 1, 'yielding': 1, 'flags': 1, 'found': 1,
'order': 1, 'left': 1})

data:

re.finditer(pattern, string, flags=0) Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match. New in version 2.2.

Upvotes: 1

Related Questions