AlfroJang80
AlfroJang80

Reputation: 125

Pandas Dataframe - multiple values inside one cell?

I'm working on an assignment and I'm having trouble with Pandas (finding it very different to the MATLAB I'm used to). I have a dataframe called 'main_DF' which has multiple columns (or series), one of these columns is called 'text-message'. I would like to take the text-message of each row, tokenize it into individual words and then place the list of those individual words into another column called 'text-message-tokens'. This is what I have at the moment:

main_DF = pd.DataFrame(columns=['text-message', 'Label']) # creating empty dataframe

# filling with data
main_DF = main_DF.append({'text-message': "I am happy", 'Label':"happy"}, ignore_index=True)
main_DF = main_DF.append({'text-message': "I am sad", 'Label':"sad"}, ignore_index=True)

# print
print(main_DF)

# Tokenizing text-message
tokenize = CountVectorizer().build_tokenizer()

# Add tokenized message to main_DF
main_DF['text-message-tokens'] = tokenize(main_DF['text-message'][0]) # tokenize first row
main_DF['text-message-tokens'] = tokenize(main_DF['text-message'][1]) # tokenize second row

# print
print(main_DF)

This results in the following

enter image description here

I would like it to be like this in the end, enter image description here

Upvotes: 0

Views: 5617

Answers (1)

Kent Shikama
Kent Shikama

Reputation: 4060

You can use apply to apply tokenize on each text message:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

main_DF = pd.DataFrame(columns=['text-message', 'Label']) # creating empty dataframe

# filling with data
main_DF = main_DF.append({'text-message': "I am happy", 'Label':"happy"}, ignore_index=True)
main_DF = main_DF.append({'text-message': "I am sad", 'Label':"sad"}, ignore_index=True)

# print
print(main_DF)

# Tokenizing text-message
tokenize = CountVectorizer().build_tokenizer()

# Add tokenized message to main_DF
main_DF["text-message-tokens"] = main_DF["text-message"].apply(tokenize)

# print
print(main_DF)

Output

main_DF                                                                   
  text-message  Label text-message-tokens
0   I am happy  happy         [am, happy]
1     I am sad    sad           [am, sad]

Upvotes: 2

Related Questions