Optimization of the given operation, is there a better way?

I am a newbie and I need some insight. Say I have a pandas dataframe as follows:

temp = pd.DataFrame()
temp['A'] = np.random.rand(100)
temp['B'] = np.random.rand(100)
temp['C'] = np.random.rand(100)

I need to write a function where I replace every value in column "C" with 0's if the value of "A" is bigger than 0.5 in the corresponding row. Otherwise I need to multiply A and B in the same row element-wise and write down the output at the corresponding row on column "C".

What I did so far, is:

A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B

It works just as I desire it to work HOWEVER I am not sure if there's a faster way to implement this. I am very skeptical especially in the slicings that I feel like it's abundant to use those many slices. Though, I couldn't find any other solutions since I have to write 0's for C rows where A is bigger than 0.5.

Or, is there a way to slice the part that is needed only, perform calculations, then somehow remember the indices so you could put the required values back to the original data-frame on the corresponding rows?

Upvotes: 1

Views: 42

Answers (1)

Chris
Chris

Reputation: 29732

One way using numpy.where:

temp["C"] = np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)

Benchmark (about 4x faster in sample, and keeps on increasing):

# With given sample of 100 rows

%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B

# 819 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)

# 174 µs ± 455 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Benchmark on larger data (about 7x faster)

temp = pd.DataFrame()
temp['A'] = np.random.rand(1000000)
temp['B'] = np.random.rand(1000000)
temp['C'] = np.random.rand(1000000)

%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B

# 35.2 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)

# 5.16 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Validation

A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
np.array_equal(temp["C"], np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0))
# True

Upvotes: 1

Related Questions