Remove punctuation from list of sentences in a pandas data frame

Question

I have email messages in a pandas data frame. Before applying sent_tokenize, I could remove the punctuation like this.

def removePunctuation(fullCorpus):
punctuationRemoved = fullCorpus['text'].str.replace(r'[^\w\s]+', '')
return  punctuationRemoved

After applying sent_tokenize the data frame looks like below. How can I remove the punctuation while keeping the sentences as tokenized in the lists?

sent_tokenize

def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
return sent_tokenized

Sample of data frame after tokenizing into sentences

[Nah I don't think he goes to usf, he lives around here though]                                                                                                                                                                                                                          

[Even my brother is not like to speak with me., They treat me like aids patent.]                                                                                                                                                                                                         

[I HAVE A DATE ON SUNDAY WITH WILL!, !]                                                                                                                                                                                                                                                  

[As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers., Press *9 to copy your friends Callertune]                                                                                                                      

[WINNER!!, As a valued network customer you have been selected to receivea £900 prize reward!, To claim call 09061701461., Claim code KL341., Valid 12 hours only.]

niraj · Accepted Answer

You can try with following function where you can use apply to iterate over each word in sentence and character and check if character is in punctuation followed by .join. Also, you may need map since you want to apply function to each sentences:

def tokenizeSentences(fullCorpus):
    sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
    f = lambda sent: ''.join(ch for w in sent for ch in w 
                                                  if ch not in string.punctuation) 

    sent_tokenized = sent_tokenized.apply(lambda row: list(map(f, row)))    
    return sent_tokenized

Note you will need import string for string.punctuation.

Remove punctuation from list of sentences in a pandas data frame

Answers (1)

Related Questions