amanb
amanb

Reputation: 5463

Split multi-word strings into individual words for Pandas series containing list of strings

I have a Pandas Dataframe that has the column values as list of strings. Each list may have one or more than one string. For strings that have more than one word, I'd like to split them into individual words, so that each list contains only individual words. In the following Dataframe, only the sent_tags column has lists which contain strings of variable length.

DataFrame:

import pandas as pd    
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame({"fruit_tags": [["'apples'", "'oranges'", "'pears'"], ["'melons'", "'peaches'", "'kiwis'"]], "sent_tags":[["'apples'", "'sweeter than oranges'", "'pears sweeter than apples'"], ["'melons'", "'sweeter than peaches'", "'kiwis sweeter than melons'"]]})
print(df)  

    fruit_tags                        sent_tags
0   ['apples', 'oranges', 'pears']  ['apples', 'sweeter than oranges', 'pears sweeter than apples']
1   ['melons', 'peaches', 'kiwis']  ['melons', 'sweeter than peaches', 'kiwis sweeter than melons']

My attempt:

I decided to use word_tokenize from the NLTK library to break such strings into individual words. I do get the tokenized words for a particular selection inside the list but cannot club them together into each list for each row:

from nltk.tokenize import word_tokenize
df['sent_tags'].str[1].str.strip("'").apply(lambda x:word_tokenize(x.lower()))
#Output
0    [sweeter, than, oranges]
1    [sweeter, than, peaches]
Name: sent_tags, dtype: object

Desired result:

    fruit_tags                        sent_tags
0   ['apples', 'oranges', 'pears']  ['apples', 'sweeter', 'than', 'oranges', 'pears', 'sweeter', 'than', 'apples']
1   ['melons', 'peaches', 'kiwis']  ['melons', 'sweeter', 'than', 'peaches', 'kiwis', 'sweeter', 'than', 'melons']

Upvotes: 4

Views: 3839

Answers (2)

Loochie
Loochie

Reputation: 2472

Another possible method could be:

df['sent_tags'].apply(lambda x: [item for elem in [y.split() for y in x] for item in elem])

Upvotes: 0

jezrael
jezrael

Reputation: 862771

Use list comprehension with flatenning with all text functions - strip, lower and split:

s = df['sent_tags'].apply(lambda x: [z for y in x for z in y.strip("'").lower().split()])

Or:

s = [[z for y in x for z in y.strip("'").lower().split()] for x in df['sent_tags']]

df['sent_tags'] = s

print(df) 
                       fruit_tags  \
0  ['apples', 'oranges', 'pears']   
1  ['melons', 'peaches', 'kiwis']   

                                                        sent_tags  
0  [apples, sweeter, than, oranges, pears, sweeter, than, apples]  
1  [melons, sweeter, than, peaches, kiwis, sweeter, than, melons]  

Upvotes: 5

Related Questions