Reputation: 5463
I have a Pandas Dataframe that has the column values as list of strings. Each list may have one or more than one string. For strings that have more than one word, I'd like to split them into individual words, so that each list contains only individual words. In the following Dataframe, only the sent_tags
column has lists which contain strings of variable length.
DataFrame:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame({"fruit_tags": [["'apples'", "'oranges'", "'pears'"], ["'melons'", "'peaches'", "'kiwis'"]], "sent_tags":[["'apples'", "'sweeter than oranges'", "'pears sweeter than apples'"], ["'melons'", "'sweeter than peaches'", "'kiwis sweeter than melons'"]]})
print(df)
fruit_tags sent_tags
0 ['apples', 'oranges', 'pears'] ['apples', 'sweeter than oranges', 'pears sweeter than apples']
1 ['melons', 'peaches', 'kiwis'] ['melons', 'sweeter than peaches', 'kiwis sweeter than melons']
My attempt:
I decided to use word_tokenize
from the NLTK library to break such strings into individual words. I do get the tokenized words for a particular selection inside the list but cannot club them together into each list for each row:
from nltk.tokenize import word_tokenize
df['sent_tags'].str[1].str.strip("'").apply(lambda x:word_tokenize(x.lower()))
#Output
0 [sweeter, than, oranges]
1 [sweeter, than, peaches]
Name: sent_tags, dtype: object
Desired result:
fruit_tags sent_tags
0 ['apples', 'oranges', 'pears'] ['apples', 'sweeter', 'than', 'oranges', 'pears', 'sweeter', 'than', 'apples']
1 ['melons', 'peaches', 'kiwis'] ['melons', 'sweeter', 'than', 'peaches', 'kiwis', 'sweeter', 'than', 'melons']
Upvotes: 4
Views: 3839
Reputation: 2472
Another possible method could be:
df['sent_tags'].apply(lambda x: [item for elem in [y.split() for y in x] for item in elem])
Upvotes: 0
Reputation: 862771
Use list comprehension with flatenning with all text functions - strip
, lower
and split
:
s = df['sent_tags'].apply(lambda x: [z for y in x for z in y.strip("'").lower().split()])
Or:
s = [[z for y in x for z in y.strip("'").lower().split()] for x in df['sent_tags']]
df['sent_tags'] = s
print(df)
fruit_tags \
0 ['apples', 'oranges', 'pears']
1 ['melons', 'peaches', 'kiwis']
sent_tags
0 [apples, sweeter, than, oranges, pears, sweeter, than, apples]
1 [melons, sweeter, than, peaches, kiwis, sweeter, than, melons]
Upvotes: 5