Python Not Removing Char From String

Question

Iv tried multiple times and ways for removing the extra punctuation from the string.

import string

class NLP:

    def __init__(self,sentence):

        self.sentence  = sentence.lower()

        self.tokenList = []


    #problem were the punct is still included in word
    def tokenize(self, sentence):

        for word in sentence.split():
            self.tokenList.append(word)

            for i in string.punctuation:
                if(i in word):
                    word.strip(i)
                    self.tokenList.append(i)

quick explanation of the code... What it is suppose to do is to split each word and punctuation and store them in a list. But when i have punctuation next to a word it stays with the word. Below is an example where a comma remains grouped with the word 'hello'

['hello,' , ',' , 'my' , 'name' , 'is' , 'freddy']
      #^
     #there's the problem

jme · Accepted Answer

I'd suggest a different approach:

import string
import itertools

def tokenize(s):
    tokens = []
    for k,v in itertools.groupby(s, lambda c: c in string.punctuation):
        tokens.extend("".join(v).split())
    return tokens

A test:

>>> tokenize("this is, a test, you know")
['this', 'is', ',', 'a', 'test', ',', 'you', 'know']

Python Not Removing Char From String

Answers (2)

Related Questions