ah bon
ah bon

Reputation: 10021

Filter list of list column then split (explode) row-wisely in Python

Let's say I have one column in a dataframe which has list of list:

   id                                                pos
0   1  [[['Malaysia','NR'], [':','PU'], ['Natural','JJ'], ['selling price','NN']]]
1   2  [[['Spot Price','NN'], [':','PU'], ['cotton','NN'], ['India', ' NR']]]

or in dictionary format:

[{'id': 1,
  'pos': "[[['Malaysia','NR'], [':','PU'], ['Natural','JJ'], ['selling price','NN']]]"},
 {'id': 2,
  'pos': "[[['Spot Price','NN'], [':','PU'], ['cotton','NN'], ['India', ' NR']]]"}]

How could I filter if second element of list is NR or NN then split (explode) pos column row-wisely as follows:

   id          words part_of_speech
0   1        Malasia             NR
1   1  selling price             NN
2   2     Spot price             NN
3   2         cotton             NN
4   2          India             NR

How could I acheive this in Python? Thanks.

Trial code:

l = [[['Malaysia','NR'], [':','PU'], ['Natural','JJ'], ['selling price','NN']]]
for elem in l[0]:
    print(elem[1])

Out:

NR
PU
JJ
NN

Upvotes: 4

Views: 221

Answers (2)

mozway
mozway

Reputation: 260735

Here is a working solution, it explodes first and filters afterwards, which I believe should be more efficient as it doesn't require looping:

# get rid of unnecessary level of nesting
df['pos'] = df['pos'].str[0]
# explode the list
df = df.explode('pos')
# split the two items to separate columns
df['words'] = df['pos'].str[0]
df['part_of_speech'] = df['pos'].str[1]
# filter output
df.drop('pos', axis=1)[df['part_of_speech'].isin(['NR', 'NN'])]

Output:

   id          words part_of_speech
0   1       Malaysia             NR
0   1  selling price             NN
1   2     Spot Price             NN
1   2         cotton             NN

Upvotes: 1

U13-Forward
U13-Forward

Reputation: 71580

You could try this with explode:

x = df.explode('pos').explode('pos')
x = x[['id']].reset_index(drop=True).join(pd.DataFrame(x['pos'].tolist()).set_axis(['words', 'part_of_speech'], axis=1))
x.loc[x['part_of_speech'].isin(['NN', 'NR'])]

   id          words part_of_speech
0   1       Malaysia             NR
3   1  selling price             NN
4   2     Spot Price             NN
6   2         cotton             NN
7   2          India             NR
>>> 

This is solution could be scaled easily for dataframes with arbitrary length, it doesn't assign columns one by one, it assigns columns at once. So it would work for arbitrary length sublists.

Upvotes: 3

Related Questions