Reputation: 1325
I'd like to create a new column in a dataframe, which contains (in each cell) the first value in an array that is smaller than the value in each corresponding cell from an already existing column. Here's a quick explanation of how this should work, there are 3 scenarios that I have identified:
1. Create and array a
from 10 to 75 with a step of 5.
2a. If column c1
has values 0 through 10 in a cell, the result in the new column should be 0
(I found this to be tricky without adding a zero to array a
, which I'd rather not do). Column c1
is always non-negative.
2b. If c1
is 11 though 75, it should return the element of a
that is immediately smaller than the cell value.
2c. If c1
is greater than 75, it should return 75. (This is just an extension of 2b, really)
Here is my try - it does the job but I feel it's quite slow. I don't think I can use np.argmax
or np.argmin
because neither acomplishes points 2a/b/c above. Hopefully there's a faster solution out there.
import numpy as np
import pandas as pd
np.random.seed(42)
N = 10**6 #number of rows in df, change this to lower values for testing
df = pd.DataFrame({'c1': np.random.randint(1,100,N)})
a = np.arange(10,80,5)
def first_lower(value, arr):
if len(arr[arr < value]) > 0:
return arr[arr < value][-1]
else:
return 0
def do_stuff(input_df):
df = input_df.copy()
df['NewCol'] = df['c1'].apply(lambda x: first_lower(x, a))
return df
%timeit do_stuff(df)
# 11.4 s ± 881 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 0
Views: 58
Reputation: 51155
Setup
np.random.seed(1995)
df = pd.DataFrame({'c1': np.random.randint(1, 100, 10)})
a = np.arange(10,80,5)
Option 1
You can use np.select
:
c1 = df.c1.isin(range(0, 11))
c2 = df.c1.isin(range(11,76))
r1 = 0
r2 = a[np.searchsorted(a, df.c1, side='left')-1]
np.select([c1, c2], [r1, r2], 75)
Output:
array([35, 75, 35, 50, 65, 25, 75, 50, 25, 65])
Option 2
Using np.clip
:
s = np.clip(df.c1, 0, 75)
s[s.isin(range(11,75))] = a[np.searchsorted(a, df.c1)-1]
Timings:
df = pd.DataFrame({'c1': np.random.randint(1,100,10**6)})
%%timeit
c1 = df.c1.isin(range(0, 11))
c2 = df.c1.isin(range(11,76))
r1 = 0
r2 = a[np.searchsorted(a, df.c1, side='left')-1]
np.select([c1, c2], [r1, r2], 75)
# 104 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
s = np.clip(df.c1, 0, 75)
s[s.isin(range(11,75))] = a[np.searchsorted(a, df.c1)-1]
# 96 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 3
Reputation: 59711
Similar to @user3483203's but a bit more general:
import numpy as np
import pandas as pd
np.random.seed(42)
N = 20 # Just 20 for testing
df = pd.DataFrame({'c1': np.random.randint(1,100,N)})
a = np.arange(10,80,5)
idx = np.searchsorted(a, df.c1)
newcol = a[idx - 1]
newcol[idx == 0] = 0
df['newcol'] = newcol
print(df)
Output:
c1 newcol
0 52 50
1 93 75
2 15 10
3 72 70
4 61 60
5 21 20
6 83 75
7 87 75
8 75 70
9 75 70
10 88 75
11 24 20
12 3 0
13 22 20
14 53 50
15 2 0
16 88 75
17 30 25
18 38 35
19 2 0
Upvotes: 1