Mike_H
Mike_H

Reputation: 1455

How to convert text in pandas dataframe (delete punctuation, split text into one word per entry)

I am cleaning data from a .txt source. The file is including WhatsApp messages in every line, including date and time stamp. I already split all of that into one column holding data and time information df['text] and one column holding all the text data df['text_new']. Based on this I want to create a word cloud. This is why I need every word from the several conversations as single entries in seperate pandas data frame entries.

I need your help for further cleaning and transformtation of this data.

Let's suppose the data frame column df['text_new'] is this:

0    How are you? 
1    I am fine, we should meet this afternoon!
2    Okay let us do that. 😋

What do I want to do?

  1. Clean every punctuations out of the text.
  2. Split the messages in seperate words, so that only one word is in one dataframe entry.
  3. If it is possible, one smiley should be considered as a single word. If this it not possible, how to clean them out?
  4. Make every text lower case. There is already a solution for that, but it would be really nice to include it into the "cleaning code".

Now that you know the three steps I want to run, maybe someone has a clean and neat way to do that.

Thank you all in advance!

Upvotes: 0

Views: 1488

Answers (1)

jezrael
jezrael

Reputation: 863301

Use:

import re

#https://stackoverflow.com/a/49146722
emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"  # emoticons
                       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                       u"\U0001F680-\U0001F6FF"  # transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251"
                       "]+", flags=re.UNICODE)

df['new'] = (df['text_new'].str.lower() #lowercase
                           .str.replace(r'[^\w\s]+', '') #rem punctuation 
                           .str.replace(emoji_pattern, '') #rem emoji
                           .str.strip() #rem trailing whitespaces
                           .str.split()) #split by whitespaces

Sample:

df = pd.DataFrame({'text_new':['How are you?',
                               'I am fine, we should meet this afternoon!',
                               'Okay let us do that. \U0001f602']})


emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"  # emoticons
                       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                       u"\U0001F680-\U0001F6FF"  # transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251"
                       "]+", flags=re.UNICODE)

import re


df['new'] = (df['text_new'].str.lower()
                           .str.replace(r'[^\w\s]+', '')
                           .str.replace(emoji_pattern, '')
                           .str.strip()
                           .str.split())
print (df)
                                    text_new  \
0                               How are you?   
1  I am fine, we should meet this afternoon!   
2                     Okay let us do that. 😂   

                                                new  
0                                   [how, are, you]  
1  [i, am, fine, we, should, meet, this, afternoon]  
2                         [okay, let, us, do, that] 

EDIT:

df['new'] = (df['text_new'].str.lower()
                           .str.replace(r'[^\w\s]+', '')
                           .str.replace(emoji_pattern, '')
                           .str.strip())
print (df)
                                    text_new  \
0                               How are you?   
1  I am fine, we should meet this afternoon!   
2                     Okay let us do that. 😂   

                                       new  
0                              how are you  
1  i am fine we should meet this afternoon  
2                      okay let us do that 

Upvotes: 2

Related Questions