Reputation: 1142
I want to apply a function to each column of a DataFrame.
Which rows to apply this to depends on some column-specific condition.
The parameter values to use also depends on the function.
Take this very simple DataFrame:
>>> df = pd.DataFrame(data=np.arange(15).reshape(5, 3))
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
I want to apply a function to each column using column-specific values contained in an array, say:
>>> multiplier = np.array([0, 100, 1000]) # First column multiplied by 0, second by 100...
I also only want to multiply rows whose index are within a column-specific range, say below the values contained in the array:
>>> limiter = np.array([2, 3, 4]) # Only first two elements in first column get multiplied, first three in second column...
What works is this:
>>> for i in range(limit.shape[0]):
>>> df.loc[df.index<limit[i], i] = multiplier[i] * df.loc[:, i]
>>> df
0 1 2
0 0 100 2000
1 0 400 5000
2 6 700 8000
3 9 10 11000
4 12 13 14
But this approach is way too slow for the large DataFrames I'm dealing with.
Is there some way to vectorize this?
Upvotes: 1
Views: 70
Reputation: 4233
You could take advantage of underlying numpy array.
df = pd.DataFrame(data=pd.np.arange(15).reshape(5, 3))
multiplier = pd.np.array([0, 100, 1000])
limit = pd.np.array([2, 3, 4])
df1 = df.values
for i in pd.np.arange(limit.size):
df1[: limit[i], i] = df1[: limit[i], i] * multiplier[i]
df2 = pd.DataFrame(df1)
print (df2)
0 1 2
0 0 100 2000
1 0 400 5000
2 6 700 8000
3 9 10 11000
4 12 13 14
Performace:
# Your implementation
%timeit for i in range(limit.shape[0]): df.loc[df.index<limit[i], i] = multiplier[i] * df.loc[:, i]
3.92 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Numpy implementation (High Performance Gain)
%timeit for i in pd.np.arange(limit.size): df1[: limit[i], i] = df1[: limit[i], i] * multiplier[i]
25 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Upvotes: 1