Shifu
Shifu

Reputation: 2185

Removing Punctuation and Capitalization from TXT file

I've got a minor problem in python. I have the script:

import nltk
def analyzer():
    inputfile=raw_input("Which file?: ")
    review=open(inputfile,'r')
    review=review.read()
    tokens=review.split()

    for token in tokens:
        if token in string.punctuation:         
            tokens.remove(token)
        token=tokens.lower()

It is supposed to import a txt file, split it in words and then remove punctuation and convert all to lowercase. Shouldn't be difficult, right? It just returns still with the punctuation and the uppercase intact. There is no error message, it just seems to ignore part of the code.

Any help would be much appreciated.

Upvotes: 0

Views: 8103

Answers (2)

user1149862
user1149862

Reputation:

There's several problems in your code:

First, split() could not split punctuations

Second, if you use for token in tokens, token is actually a copy of the elements in tokens so changement on token wouldn't change tokens.

Try this:

import string
import re
def analyzer():
    inputfile=raw_input("Which file?: ")
    review=open(inputfile,'r')
    review=review.read()
    tokens=[e.lower() for e in map(string.strip, re.split("(\W+)", review)) if len(e) > 0 and not re.match("\W",e)]

    print tokens

analyzer()

The pattern [FUNC(x) for x in LIST if COND] gives out a list constructd by FUNC(x), where x is the element from LIST when COND is true. You could refer to filter and map. For the regex part you may look into re

Upvotes: 2

rgrinberg
rgrinberg

Reputation: 9878

I'm assuming you have the string module imported. Replace the line

if token in string.punctuation:         
     tokens.remove(token)
     token=tokens.lower()

with

token = token.translate(None,string.punctuation).lower()

Also, strings are immutable in python so assigning to them just rebinds the name it does not change the original tokens. If you'd like to change the tokens then you can do the following

tokens = [token.translate(None,string.punctuation).lower() for token in tokens]

Personally I would clean up the whole thing like this:

def read_tokens(path):
    import string
    with open(path) as f:
        tokens = f.read().split()
        return [ token.translate(None, string.punctuation).lower() for token in tokens ]

read_tokens(raw_input("which file?"))

Note that this is just a faithful translation of your original intentions which mean that a "word" like 'test.me' would turn into ['testme'] rather than ['test','me']

Upvotes: 2

Related Questions