Kabilesh
Kabilesh

Reputation: 1012

Remove punctuation from list of sentences in a pandas data frame

I have email messages in a pandas data frame. Before applying sent_tokenize, I could remove the punctuation like this.

def removePunctuation(fullCorpus):
punctuationRemoved = fullCorpus['text'].str.replace(r'[^\w\s]+', '')
return  punctuationRemoved

After applying sent_tokenize the data frame looks like below. How can I remove the punctuation while keeping the sentences as tokenized in the lists?

sent_tokenize

def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
return sent_tokenized

Sample of data frame after tokenizing into sentences

[Nah I don't think he goes to usf, he lives around here though]                                                                                                                                                                                                                          

[Even my brother is not like to speak with me., They treat me like aids patent.]                                                                                                                                                                                                         

[I HAVE A DATE ON SUNDAY WITH WILL!, !]                                                                                                                                                                                                                                                  

[As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers., Press *9 to copy your friends Callertune]                                                                                                                      

[WINNER!!, As a valued network customer you have been selected to receivea £900 prize reward!, To claim call 09061701461., Claim code KL341., Valid 12 hours only.]

Upvotes: 1

Views: 604

Answers (1)

niraj
niraj

Reputation: 18218

You can try with following function where you can use apply to iterate over each word in sentence and character and check if character is in punctuation followed by .join. Also, you may need map since you want to apply function to each sentences:

def tokenizeSentences(fullCorpus):
    sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
    f = lambda sent: ''.join(ch for w in sent for ch in w 
                                                  if ch not in string.punctuation) 

    sent_tokenized = sent_tokenized.apply(lambda row: list(map(f, row)))    
    return sent_tokenized

Note you will need import string for string.punctuation.

Upvotes: 1

Related Questions