MPa
MPa

Reputation: 1142

Applying function to each column of a DataFrame depending on a column-specific condition without loop

I want to apply a function to each column of a DataFrame.
Which rows to apply this to depends on some column-specific condition.
The parameter values to use also depends on the function.

Take this very simple DataFrame:

>>> df = pd.DataFrame(data=np.arange(15).reshape(5, 3))
>>> df

    0   1   2
0   0   1   2
1   3   4   5
2   6   7   8
3   9   10  11
4   12  13  14

I want to apply a function to each column using column-specific values contained in an array, say:

>>> multiplier = np.array([0, 100, 1000]) # First column multiplied by 0, second by 100...

I also only want to multiply rows whose index are within a column-specific range, say below the values contained in the array:

>>> limiter = np.array([2, 3, 4]) # Only first two elements in first column get multiplied, first three in second column...

What works is this:

>>> for i in range(limit.shape[0]):
>>>     df.loc[df.index<limit[i], i] = multiplier[i] * df.loc[:, i]
>>> df

    0   1   2
0   0   100 2000
1   0   400 5000
2   6   700 8000
3   9   10  11000
4   12  13  14

But this approach is way too slow for the large DataFrames I'm dealing with.

Is there some way to vectorize this?

Upvotes: 1

Views: 70

Answers (1)

Abhi
Abhi

Reputation: 4233

You could take advantage of underlying numpy array.

df = pd.DataFrame(data=pd.np.arange(15).reshape(5, 3))

multiplier = pd.np.array([0, 100, 1000])
limit = pd.np.array([2, 3, 4])

df1 = df.values

for i in pd.np.arange(limit.size): 
    df1[: limit[i], i] = df1[: limit[i], i] * multiplier[i]

df2 = pd.DataFrame(df1)

print (df2)


     0    1      2
0    0  100    2000
1    0  400    5000
2    6  700    8000
3    9   10   11000
4   12   13      14

Performace:

# Your implementation
%timeit for i in range(limit.shape[0]): df.loc[df.index<limit[i], i] = multiplier[i] * df.loc[:, i]
3.92 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Numpy implementation (High Performance Gain)
%timeit for i in pd.np.arange(limit.size): df1[: limit[i], i] = df1[: limit[i], i] * multiplier[i]
25 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Upvotes: 1

Related Questions