Adam
Adam

Reputation: 581

calculate minimum value from row calculation in pandas

I have a list of throughput values in one column of a pandas DataFrame. I want to calculate a value's variation from a threshold as a percent of that threshold.

So if I have thresholds at 2 and 7. I'd want to calculate the minimum of the following functions.

(df.throughput - 2)/2  
(df.throughput - 7)/7

x   throughtput
1   3
4   4
7   9

I tried using the following to create a new column, but I keep getting an error. I fell like there's something really obvious i'm missing here.

df['pct'] =  np.min(  (df.throughput-2)/2,  (df.throughput - 7)/7  )
df['pct'] =  np.min(  (df['throughput']-2)/2,  (df['throughput'] - 7)/7  )
'Series' objects are mutable, thus they cannot be hashed

Upvotes: 1

Views: 824

Answers (2)

jezrael
jezrael

Reputation: 863731

You can create new Series, compare them and use numpy.where for new column:

a = (df['throughtput'] - 2)/2
b = (df['throughtput'] - 7)/7
df['pct'] = np.where(a < b, a, b)
print (df)
   x  throughtput       pct
0  1            3 -0.571429
1  4            4 -0.428571
2  7            9  0.285714

Solution with concat and DataFrame.min:

a = (df['throughtput'] - 2)/2
b = (df['throughtput'] - 7)/7
df['pct'] = pd.concat([a,b], axis=1).min(axis=1)
print (df)
   x  throughtput       pct
0  1            3 -0.571429
1  4            4 -0.428571
2  7            9  0.285714

Or create 2d array by numpy.column_stack and get min by numpy.amin:

a = (df['throughtput'] - 2)/2
b = (df['throughtput'] - 7)/7
df['pct'] = np.amin(np.column_stack([a,b]), axis=1)
print (df)
   x  throughtput       pct
0  1            3 -0.571429
1  4            4 -0.428571
2  7            9  0.285714

a = (df['throughtput'].values - 2)/2
b = (df['throughtput'].values - 7)/7
df['pct'] = np.amin(np.column_stack([a,b]), axis=1)
print (df)
   x  throughtput       pct
0  1            3 -0.571429
1  4            4 -0.428571
2  7            9  0.285714

Timings:

N = 1000000
#N = 10
df = pd.DataFrame({'x': np.random.randint(10,size=N),
                   'throughtput':np.random.randint(10,size=N)})
print (df)

In [50]: %%timeit
    ...: a = (df['throughtput'] - 2)/2
    ...: b = (df['throughtput'] - 7)/7
    ...: df['pct'] = np.where(a < b, a, b)
    ...: 
10 loops, best of 3: 21.1 ms per loop

In [51]: %%timeit
    ...: a = (df['throughtput'] - 2)/2
    ...: b = (df['throughtput'] - 7)/7
    ...: df['pct'] = pd.concat([a,b], axis=1).min(axis=1)
    ...: 
10 loops, best of 3: 56.4 ms per loop

In [52]: %%timeit
    ...: a = (df['throughtput'] - 2)/2
    ...: b = (df['throughtput'] - 7)/7
    ...: df['pct'] = np.amin(np.column_stack([a,b]), axis=1)
    ...: 
10 loops, best of 3: 35.1 ms per loop


In [53]: %%timeit
    ...: a = (df['throughtput'].values - 2)/2
    ...: b = (df['throughtput'].values - 7)/7
    ...: df['pct'] = np.amin(np.column_stack([a,b]), axis=1)
    ...: 
10 loops, best of 3: 38.5 ms per loop

Another answer by Tiny.D:

In [54]: %%timeit
    ...: df['cal_1'] = (df.throughtput - 2)/2
    ...: df['cal_2'] = (df.throughtput - 7)/7
    ...: df['pct'] = df[['cal_1','cal_2']].min(axis=1)
    ...: df[['x','throughtput','pct']]
    ...: 
10 loops, best of 3: 73.7 ms per loop

In [55]: %%timeit
    ...: df['pct']=[min(i,j) for i,j in (zip((df.throughtput - 2)/2,(df.throughtput - 7)/7))]
    ...: 
1 loop, best of 3: 435 ms per loop

Upvotes: 1

Tiny.D
Tiny.D

Reputation: 6556

Try this:

df['pct'] = [min(i,j) for i,j in (zip((df.throughput - 2)/2,(df.throughput - 7)/7))]
df

output will be:

    throughput  x   pct
0   3           1   -0.571429
1   4           4   -0.428571
2   9           7   0.285714

or:

df['cal_1'] = (df.throughput - 2)/2
df['cal_2'] = (df.throughput - 7)/7
df['pct'] = df[['cal_1','cal_2']].min(axis=1)
df[['x','throughput','pct']]

output will be:

    x   throughput  pct
0   1   3           -0.571429
1   4   4           -0.428571
2   7   9           0.285714

Upvotes: 0

Related Questions