Reputation: 305
Iv tried multiple times and ways for removing the extra punctuation from the string.
import string
class NLP:
def __init__(self,sentence):
self.sentence = sentence.lower()
self.tokenList = []
#problem were the punct is still included in word
def tokenize(self, sentence):
for word in sentence.split():
self.tokenList.append(word)
for i in string.punctuation:
if(i in word):
word.strip(i)
self.tokenList.append(i)
quick explanation of the code... What it is suppose to do is to split each word and punctuation and store them in a list. But when i have punctuation next to a word it stays with the word. Below is an example where a comma remains grouped with the word 'hello'
['hello,' , ',' , 'my' , 'name' , 'is' , 'freddy']
#^
#there's the problem
Upvotes: 0
Views: 71
Reputation: 881675
A Python string is immutable. Therefore, word.strip(i)
does not "change word
in place" as you seem to assume; rather, it returns a copy of word
, modified by the .strip(i)
operation -- which removes only from the ends of the string, so that's not what you want either (unless you know the punctuation occurs in the word in a peculiar order).
def tokenize(self, sentence):
for word in sentence.split():
punc = []
for i in string.punctuation:
howmany = word.count(i)
if not howmany: continue
word = word.replace(i, '')
punc.extend(howmany*[i])
self.tokenList.append(word)
self.tokenList.extend(punc)
This assumes it's OK to have all the punctuation, one per item, after the cleaned-up word, independently of where within the word the punctuation appeared.
For example, should the sentence
be (here)
, the list would be ['here', '(', ')']
.
If there are stricter constraints on the ordering of things in the list, please edit your Q to express them clearly -- ideally with examples of desired input and output, too!
Upvotes: 2
Reputation: 20695
I'd suggest a different approach:
import string
import itertools
def tokenize(s):
tokens = []
for k,v in itertools.groupby(s, lambda c: c in string.punctuation):
tokens.extend("".join(v).split())
return tokens
A test:
>>> tokenize("this is, a test, you know")
['this', 'is', ',', 'a', 'test', ',', 'you', 'know']
Upvotes: 1