Removing Punctuation and Capitalization from TXT file

Question

I've got a minor problem in python. I have the script:

import nltk
def analyzer():
    inputfile=raw_input("Which file?: ")
    review=open(inputfile,'r')
    review=review.read()
    tokens=review.split()

    for token in tokens:
        if token in string.punctuation:         
            tokens.remove(token)
        token=tokens.lower()

It is supposed to import a txt file, split it in words and then remove punctuation and convert all to lowercase. Shouldn't be difficult, right? It just returns still with the punctuation and the uppercase intact. There is no error message, it just seems to ignore part of the code.

Any help would be much appreciated.

user1149862 · Accepted Answer

There's several problems in your code:

First, split() could not split punctuations

Second, if you use for token in tokens, token is actually a copy of the elements in tokens so changement on token wouldn't change tokens.

Try this:

import string
import re
def analyzer():
    inputfile=raw_input("Which file?: ")
    review=open(inputfile,'r')
    review=review.read()
    tokens=[e.lower() for e in map(string.strip, re.split("(\W+)", review)) if len(e) > 0 and not re.match("\W",e)]

    print tokens

analyzer()

The pattern [FUNC(x) for x in LIST if COND] gives out a list constructd by FUNC(x), where x is the element from LIST when COND is true. You could refer to filter and map. For the regex part you may look into re

Removing Punctuation and Capitalization from TXT file

Answers (2)

Related Questions