Reputation: 2185
I've got a minor problem in python. I have the script:
import nltk
def analyzer():
inputfile=raw_input("Which file?: ")
review=open(inputfile,'r')
review=review.read()
tokens=review.split()
for token in tokens:
if token in string.punctuation:
tokens.remove(token)
token=tokens.lower()
It is supposed to import a txt file, split it in words and then remove punctuation and convert all to lowercase. Shouldn't be difficult, right? It just returns still with the punctuation and the uppercase intact. There is no error message, it just seems to ignore part of the code.
Any help would be much appreciated.
Upvotes: 0
Views: 8103
Reputation:
There's several problems in your code:
First, split()
could not split punctuations
Second, if you use for token in tokens
, token
is actually a copy of the elements in tokens
so changement on token
wouldn't change tokens
.
Try this:
import string
import re
def analyzer():
inputfile=raw_input("Which file?: ")
review=open(inputfile,'r')
review=review.read()
tokens=[e.lower() for e in map(string.strip, re.split("(\W+)", review)) if len(e) > 0 and not re.match("\W",e)]
print tokens
analyzer()
The pattern [FUNC(x) for x in LIST if COND]
gives out a list constructd by FUNC(x), where x is the element from LIST when COND is true. You could refer to filter and map. For the regex part you may look into re
Upvotes: 2
Reputation: 9878
I'm assuming you have the string
module imported. Replace the line
if token in string.punctuation:
tokens.remove(token)
token=tokens.lower()
with
token = token.translate(None,string.punctuation).lower()
Also, strings are immutable in python so assigning to them just rebinds the name it does not change the original tokens. If you'd like to change the tokens then you can do the following
tokens = [token.translate(None,string.punctuation).lower() for token in tokens]
Personally I would clean up the whole thing like this:
def read_tokens(path):
import string
with open(path) as f:
tokens = f.read().split()
return [ token.translate(None, string.punctuation).lower() for token in tokens ]
read_tokens(raw_input("which file?"))
Note that this is just a faithful translation of your original intentions which mean that a "word" like 'test.me'
would turn into ['testme']
rather than ['test','me']
Upvotes: 2