Freddy-FazBear
Freddy-FazBear

Reputation: 305

Python Not Removing Char From String

Iv tried multiple times and ways for removing the extra punctuation from the string.

import string

class NLP:

    def __init__(self,sentence):

        self.sentence  = sentence.lower()

        self.tokenList = []


    #problem were the punct is still included in word
    def tokenize(self, sentence):

        for word in sentence.split():
            self.tokenList.append(word)

            for i in string.punctuation:
                if(i in word):
                    word.strip(i)
                    self.tokenList.append(i)

quick explanation of the code... What it is suppose to do is to split each word and punctuation and store them in a list. But when i have punctuation next to a word it stays with the word. Below is an example where a comma remains grouped with the word 'hello'

['hello,' , ',' , 'my' , 'name' , 'is' , 'freddy']
      #^
     #there's the problem

Upvotes: 0

Views: 71

Answers (2)

Alex Martelli
Alex Martelli

Reputation: 881675

A Python string is immutable. Therefore, word.strip(i) does not "change word in place" as you seem to assume; rather, it returns a copy of word, modified by the .strip(i) operation -- which removes only from the ends of the string, so that's not what you want either (unless you know the punctuation occurs in the word in a peculiar order).

def tokenize(self, sentence):
    for word in sentence.split():
        punc = []
        for i in string.punctuation:
            howmany = word.count(i)
            if not howmany: continue
            word = word.replace(i, '')
            punc.extend(howmany*[i])
        self.tokenList.append(word)
        self.tokenList.extend(punc)

This assumes it's OK to have all the punctuation, one per item, after the cleaned-up word, independently of where within the word the punctuation appeared.

For example, should the sentence be (here), the list would be ['here', '(', ')'].

If there are stricter constraints on the ordering of things in the list, please edit your Q to express them clearly -- ideally with examples of desired input and output, too!

Upvotes: 2

jme
jme

Reputation: 20695

I'd suggest a different approach:

import string
import itertools

def tokenize(s):
    tokens = []
    for k,v in itertools.groupby(s, lambda c: c in string.punctuation):
        tokens.extend("".join(v).split())
    return tokens

A test:

>>> tokenize("this is, a test, you know")
['this', 'is', ',', 'a', 'test', ',', 'you', 'know']

Upvotes: 1

Related Questions