Reputation: 93
I got a Pandas DataFrame looking like the following:
values max_val_idx
0 np.array([-0.649626, -0.662434, -0.611351]) 2
1 np.array([-0.994942, -0.990448, -1.01574]) 1
2 np.array([-1.012, -1.01034, -1.02732]) 0
df['values']
contains numpy arrays of a fixed length of 3 elements
df['max_val_idx]
contains the index of the maximum value of the corresponding array
Since the index of the maximum element for each array is already given, what is the most efficient way to extract the maximum for each entry?
I know the data is stored somewhat silly, but I didn't create it myself. And since I got a bunch of data to process (+- 50GB, as several hundreds of pickled databases stored in a similar way), I'd like to know what is the most time efficient method.
So far I tried to loop over each element of df['max_val_idx]
and use it as an index for each array found in df['values']
:
max_val = []
for idx, values in enumerate(df['values']):
max_val.append(values[int(df['max_val_idx'].iloc[idx])])
Is there any faster alternative to this?
Upvotes: 3
Views: 10446
Reputation: 30414
I would just forget the 'max_val_idx' column. I don't think it saves time and actually is more of a pain for syntax. Sample data:
df = pd.DataFrame({ 'x': range(3) }).applymap( lambda x: np.random.randn(3) )
x
0 [-1.17106202376, -1.61211460669, 0.0198122724315]
1 [0.806819945736, 1.49139051675, -0.21434675401]
2 [-0.427272615966, 0.0939459129359, 0.496474566...
You could extract the max like this:
df.applymap( lambda x: x.max() )
x
0 0.019812
1 1.491391
2 0.496475
But generally speaking, life is easier if you have one number per cell. If each cell has an array of length 3, you could rearrange like this:
for i, v in enumerate(list('abc')): df[v] = df.x.map( lambda x: x[i] )
df = df[list('abc')]
a b c
0 -1.171062 -1.612115 0.019812
1 0.806820 1.491391 -0.214347
2 -0.427273 0.093946 0.496475
And then do a standard pandas operation:
df.apply( max, axis=1 )
x
0 0.019812
1 1.491391
2 0.496475
Admittedly, this is not much easier than above, but overall the data will be much easier to work with in this form.
Upvotes: 4
Reputation: 1430
I don't know how the speed of this will compare, since I'm constructing a 2D matrix of all the rows, but here's a possible solution:
>>> np.choose(df['max_val_idx'], np.array(df['values'].tolist()).T)
0 -0.611351
1 -0.990448
2 -1.012000
Upvotes: 2