bogdanCsn
bogdanCsn

Reputation: 1325

Optimization - return first value in array that is smaller then cell value (for each row)

I'd like to create a new column in a dataframe, which contains (in each cell) the first value in an array that is smaller than the value in each corresponding cell from an already existing column. Here's a quick explanation of how this should work, there are 3 scenarios that I have identified:
1. Create and array a from 10 to 75 with a step of 5.
2a. If column c1 has values 0 through 10 in a cell, the result in the new column should be 0 (I found this to be tricky without adding a zero to array a, which I'd rather not do). Column c1 is always non-negative.
2b. If c1 is 11 though 75, it should return the element of a that is immediately smaller than the cell value.
2c. If c1 is greater than 75, it should return 75. (This is just an extension of 2b, really)

Here is my try - it does the job but I feel it's quite slow. I don't think I can use np.argmax or np.argmin because neither acomplishes points 2a/b/c above. Hopefully there's a faster solution out there.

import numpy as np
import pandas as pd

np.random.seed(42)
N = 10**6  #number of rows in df, change this to lower values for testing
df = pd.DataFrame({'c1': np.random.randint(1,100,N)}) 

a = np.arange(10,80,5)

def first_lower(value, arr):
    if len(arr[arr < value]) > 0:
        return arr[arr < value][-1]  
    else:
        return 0

def do_stuff(input_df):
    df = input_df.copy()
    df['NewCol'] = df['c1'].apply(lambda x: first_lower(x, a))
    return df

%timeit do_stuff(df)
# 11.4 s ± 881 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 0

Views: 58

Answers (2)

user3483203
user3483203

Reputation: 51155

Setup

np.random.seed(1995)
df = pd.DataFrame({'c1': np.random.randint(1, 100, 10)})
a = np.arange(10,80,5)

Option 1
You can use np.select:

c1 = df.c1.isin(range(0, 11))
c2 = df.c1.isin(range(11,76))

r1 = 0
r2 = a[np.searchsorted(a, df.c1, side='left')-1]

np.select([c1, c2], [r1, r2], 75)

Output:

array([35, 75, 35, 50, 65, 25, 75, 50, 25, 65])

Option 2
Using np.clip:

s = np.clip(df.c1, 0, 75)
s[s.isin(range(11,75))] = a[np.searchsorted(a, df.c1)-1]

Timings:

df = pd.DataFrame({'c1': np.random.randint(1,100,10**6)})
%%timeit
c1 = df.c1.isin(range(0, 11))
c2 = df.c1.isin(range(11,76))
r1 = 0
r2 = a[np.searchsorted(a, df.c1, side='left')-1]    
np.select([c1, c2], [r1, r2], 75)

# 104 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
s = np.clip(df.c1, 0, 75)
s[s.isin(range(11,75))] = a[np.searchsorted(a, df.c1)-1]

# 96 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 3

javidcf
javidcf

Reputation: 59711

Similar to @user3483203's but a bit more general:

import numpy as np
import pandas as pd

np.random.seed(42)
N = 20  # Just 20 for testing
df = pd.DataFrame({'c1': np.random.randint(1,100,N)}) 
a = np.arange(10,80,5)
idx = np.searchsorted(a, df.c1)
newcol = a[idx - 1]
newcol[idx == 0] = 0
df['newcol'] = newcol
print(df)

Output:

    c1  newcol
0   52      50
1   93      75
2   15      10
3   72      70
4   61      60
5   21      20
6   83      75
7   87      75
8   75      70
9   75      70
10  88      75
11  24      20
12   3       0
13  22      20
14  53      50
15   2       0
16  88      75
17  30      25
18  38      35
19   2       0

Upvotes: 1

Related Questions