Reputation: 125
I'm working on an assignment and I'm having trouble with Pandas (finding it very different to the MATLAB I'm used to). I have a dataframe called 'main_DF' which has multiple columns (or series), one of these columns is called 'text-message'. I would like to take the text-message of each row, tokenize it into individual words and then place the list of those individual words into another column called 'text-message-tokens'. This is what I have at the moment:
main_DF = pd.DataFrame(columns=['text-message', 'Label']) # creating empty dataframe
# filling with data
main_DF = main_DF.append({'text-message': "I am happy", 'Label':"happy"}, ignore_index=True)
main_DF = main_DF.append({'text-message': "I am sad", 'Label':"sad"}, ignore_index=True)
# print
print(main_DF)
# Tokenizing text-message
tokenize = CountVectorizer().build_tokenizer()
# Add tokenized message to main_DF
main_DF['text-message-tokens'] = tokenize(main_DF['text-message'][0]) # tokenize first row
main_DF['text-message-tokens'] = tokenize(main_DF['text-message'][1]) # tokenize second row
# print
print(main_DF)
This results in the following
I would like it to be like this in the end,
Upvotes: 0
Views: 5617
Reputation: 4060
You can use apply to apply tokenize on each text message:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
main_DF = pd.DataFrame(columns=['text-message', 'Label']) # creating empty dataframe
# filling with data
main_DF = main_DF.append({'text-message': "I am happy", 'Label':"happy"}, ignore_index=True)
main_DF = main_DF.append({'text-message': "I am sad", 'Label':"sad"}, ignore_index=True)
# print
print(main_DF)
# Tokenizing text-message
tokenize = CountVectorizer().build_tokenizer()
# Add tokenized message to main_DF
main_DF["text-message-tokens"] = main_DF["text-message"].apply(tokenize)
# print
print(main_DF)
Output
main_DF
text-message Label text-message-tokens
0 I am happy happy [am, happy]
1 I am sad sad [am, sad]
Upvotes: 2