Rory LM
Rory LM

Reputation: 180

Pandas working with 2D array inside dataframe

I have a Pandas DataFrame containing a 2D array as a column looking something like the following:

Name    2DValueList
item 1  [ [ 0.0, 1.0 ], [ 0.0, 6.0 ], [ 0.0, 2.0 ] ]
item 2  [ [ 0.0, 2.0 ], [ 0.0, 1.0 ], [ 0.0, 1.0 ] ]
item 3  [ [ 0.0, 1.0 ], [ 0.0, 3.0 ], [ 0.0, 5.0 ], [ 0.0, 1.0 ] ]
item 4  
item 5  [ [ 0.0, 4.0 ], [ 0.0, 1.0 ], [ 0.0, 2.0 ] ]

The first value isn't relative to this question so I've just made them all 0. I'm only interested in the second values. Also notice the amount of pairs can vary or be empty.

I want to be able to make a new dataframe that just contains the top (largest) n elements from the array.

It would look like this for the top 2 elements:

Name    2DValueList
item 1  [ [ 0.0, 6.0 ], [ 0.0, 2.0 ] ]
item 2  [ [ 0.0, 2.0 ], [ 0.0, 1.0 ] ]
item 3  [ [ 0.0, 5.0 ], [ 0.0, 3.0 ] ]
item 4
item 5  [ [ 0.0, 4.0 ], [ 0.0, 2.0 ] ]

I would use pandas nlargest, but I'm not sure how to make it accept a column that is a 2D array.

In reality, the 2D array holds thousands of value pairs and there are tens of thousands of rows. I'm open to better ways to hold this data that would be more versatile.

Upvotes: 1

Views: 766

Answers (2)

Andy L.
Andy L.

Reputation: 25239

If every cell of 2DValueList is list of lists, the efficient way is using heapq.nlargest with itemgetter together with list comprehension

from heapq import nlargest
from operator import itemgetter

df['new_list'] = [nlargest(2, x, key=itemgetter(1)) for x in df['2DValueList']]

Out[119]:
     Name               2DValueList            new_list
0  item 1  [[0, 1], [0, 6], [0, 2]]    [[0, 6], [0, 2]]
1  item 2  [[0, 2], [0, 1], [0, 1]]    [[0, 2], [0, 1]]
2  item 3  [[0, 1], [0, 3], [0, 5]]    [[0, 5], [0, 3]]
3  item 4  [[0, 4], [0, 1], [0, 2]]    [[0, 4], [0, 2]]

If each cell is a numpy 2darray, the above method still works fine. However, I think using numpy argsort would be better

df['new_list'] = [x[np.argsort(-x, axis=0)[:2,1]] for x in df['2DValueList']]

Out[128]:
     Name               2DValueList            new_list
0  item 1  [[0, 1], [0, 6], [0, 2]]    [[0, 6], [0, 2]]
1  item 2  [[0, 2], [0, 1], [0, 1]]    [[0, 2], [0, 1]]
2  item 3  [[0, 1], [0, 3], [0, 5]]    [[0, 5], [0, 3]]
3  item 4  [[0, 4], [0, 1], [0, 2]]    [[0, 4], [0, 2]]

Lastly, if you don't need the top n largest sub-array in sorted order, argpartition would be faster than argsort

Upvotes: 1

iamklaus
iamklaus

Reputation: 3770

import ast

df['2DValueList'] = df['2DValueList'].apply(ast.literal_eval).apply(lambda x: sorted(x,reverse=True)[:2])

     Name       2DValueList
0  item 1  [[0, 6], [0, 2]]
1  item 2  [[0, 2], [0, 1]]
2  item 3  [[0, 5], [0, 3]]
3  item 4  [[0, 4], [0, 2]]

Upvotes: 0

Related Questions