Reputation: 1012
I have email messages in a pandas data frame. Before applying sent_tokenize, I could remove the punctuation like this.
def removePunctuation(fullCorpus):
punctuationRemoved = fullCorpus['text'].str.replace(r'[^\w\s]+', '')
return punctuationRemoved
After applying sent_tokenize the data frame looks like below. How can I remove the punctuation while keeping the sentences as tokenized in the lists?
sent_tokenize
def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
return sent_tokenized
Sample of data frame after tokenizing into sentences
[Nah I don't think he goes to usf, he lives around here though]
[Even my brother is not like to speak with me., They treat me like aids patent.]
[I HAVE A DATE ON SUNDAY WITH WILL!, !]
[As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers., Press *9 to copy your friends Callertune]
[WINNER!!, As a valued network customer you have been selected to receivea £900 prize reward!, To claim call 09061701461., Claim code KL341., Valid 12 hours only.]
Upvotes: 1
Views: 604
Reputation: 18218
You can try with following function where you can use apply
to iterate over each word in sentence and character and check if character is in punctuation followed by .join
. Also, you may need map
since you want to apply function to each sentences:
def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
f = lambda sent: ''.join(ch for w in sent for ch in w
if ch not in string.punctuation)
sent_tokenized = sent_tokenized.apply(lambda row: list(map(f, row)))
return sent_tokenized
Note you will need import string
for string.punctuation
.
Upvotes: 1