Split multi-word strings into individual words for Pandas series containing list of strings

Question

I have a Pandas Dataframe that has the column values as list of strings. Each list may have one or more than one string. For strings that have more than one word, I'd like to split them into individual words, so that each list contains only individual words. In the following Dataframe, only the sent_tags column has lists which contain strings of variable length.

DataFrame:

import pandas as pd    
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame({"fruit_tags": [["'apples'", "'oranges'", "'pears'"], ["'melons'", "'peaches'", "'kiwis'"]], "sent_tags":[["'apples'", "'sweeter than oranges'", "'pears sweeter than apples'"], ["'melons'", "'sweeter than peaches'", "'kiwis sweeter than melons'"]]})
print(df)  

    fruit_tags                        sent_tags
0   ['apples', 'oranges', 'pears']  ['apples', 'sweeter than oranges', 'pears sweeter than apples']
1   ['melons', 'peaches', 'kiwis']  ['melons', 'sweeter than peaches', 'kiwis sweeter than melons']

My attempt:

I decided to use word_tokenize from the NLTK library to break such strings into individual words. I do get the tokenized words for a particular selection inside the list but cannot club them together into each list for each row:

from nltk.tokenize import word_tokenize
df['sent_tags'].str[1].str.strip("'").apply(lambda x:word_tokenize(x.lower()))
#Output
0    [sweeter, than, oranges]
1    [sweeter, than, peaches]
Name: sent_tags, dtype: object

Desired result:

    fruit_tags                        sent_tags
0   ['apples', 'oranges', 'pears']  ['apples', 'sweeter', 'than', 'oranges', 'pears', 'sweeter', 'than', 'apples']
1   ['melons', 'peaches', 'kiwis']  ['melons', 'sweeter', 'than', 'peaches', 'kiwis', 'sweeter', 'than', 'melons']

jezrael · Accepted Answer

Use list comprehension with flatenning with all text functions - strip, lower and split:

s = df['sent_tags'].apply(lambda x: [z for y in x for z in y.strip("'").lower().split()])

Or:

s = [[z for y in x for z in y.strip("'").lower().split()] for x in df['sent_tags']]

df['sent_tags'] = s

print(df) 
                       fruit_tags  \
0  ['apples', 'oranges', 'pears']   
1  ['melons', 'peaches', 'kiwis']   

                                                        sent_tags  
0  [apples, sweeter, than, oranges, pears, sweeter, than, apples]  
1  [melons, sweeter, than, peaches, kiwis, sweeter, than, melons]

Split multi-word strings into individual words for Pandas series containing list of strings

Answers (2)

Related Questions