Reputation: 180
I have a Pandas DataFrame containing a 2D array as a column looking something like the following:
Name 2DValueList
item 1 [ [ 0.0, 1.0 ], [ 0.0, 6.0 ], [ 0.0, 2.0 ] ]
item 2 [ [ 0.0, 2.0 ], [ 0.0, 1.0 ], [ 0.0, 1.0 ] ]
item 3 [ [ 0.0, 1.0 ], [ 0.0, 3.0 ], [ 0.0, 5.0 ], [ 0.0, 1.0 ] ]
item 4
item 5 [ [ 0.0, 4.0 ], [ 0.0, 1.0 ], [ 0.0, 2.0 ] ]
The first value isn't relative to this question so I've just made them all 0. I'm only interested in the second values. Also notice the amount of pairs can vary or be empty.
I want to be able to make a new dataframe that just contains the top (largest) n elements from the array.
It would look like this for the top 2 elements:
Name 2DValueList
item 1 [ [ 0.0, 6.0 ], [ 0.0, 2.0 ] ]
item 2 [ [ 0.0, 2.0 ], [ 0.0, 1.0 ] ]
item 3 [ [ 0.0, 5.0 ], [ 0.0, 3.0 ] ]
item 4
item 5 [ [ 0.0, 4.0 ], [ 0.0, 2.0 ] ]
I would use pandas nlargest, but I'm not sure how to make it accept a column that is a 2D array.
In reality, the 2D array holds thousands of value pairs and there are tens of thousands of rows. I'm open to better ways to hold this data that would be more versatile.
Upvotes: 1
Views: 766
Reputation: 25239
If every cell of 2DValueList
is list of lists, the efficient way is using heapq.nlargest
with itemgetter
together with list comprehension
from heapq import nlargest
from operator import itemgetter
df['new_list'] = [nlargest(2, x, key=itemgetter(1)) for x in df['2DValueList']]
Out[119]:
Name 2DValueList new_list
0 item 1 [[0, 1], [0, 6], [0, 2]] [[0, 6], [0, 2]]
1 item 2 [[0, 2], [0, 1], [0, 1]] [[0, 2], [0, 1]]
2 item 3 [[0, 1], [0, 3], [0, 5]] [[0, 5], [0, 3]]
3 item 4 [[0, 4], [0, 1], [0, 2]] [[0, 4], [0, 2]]
If each cell is a numpy 2darray, the above method still works fine. However, I think using numpy argsort
would be better
df['new_list'] = [x[np.argsort(-x, axis=0)[:2,1]] for x in df['2DValueList']]
Out[128]:
Name 2DValueList new_list
0 item 1 [[0, 1], [0, 6], [0, 2]] [[0, 6], [0, 2]]
1 item 2 [[0, 2], [0, 1], [0, 1]] [[0, 2], [0, 1]]
2 item 3 [[0, 1], [0, 3], [0, 5]] [[0, 5], [0, 3]]
3 item 4 [[0, 4], [0, 1], [0, 2]] [[0, 4], [0, 2]]
Lastly, if you don't need the top n largest sub-array in sorted order, argpartition
would be faster than argsort
Upvotes: 1
Reputation: 3770
import ast
df['2DValueList'] = df['2DValueList'].apply(ast.literal_eval).apply(lambda x: sorted(x,reverse=True)[:2])
Name 2DValueList
0 item 1 [[0, 6], [0, 2]]
1 item 2 [[0, 2], [0, 1]]
2 item 3 [[0, 5], [0, 3]]
3 item 4 [[0, 4], [0, 2]]
Upvotes: 0