Pandas Vectorization with Function on Parts of Column

Question

So I have a dataframe that looks something like this:

df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
   0  1  2
0  1  2  3
1  5  7  8
2  2  5  4

I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:

df1['3'] = add5(df1[2])

But my goal is to do something like this:

df1['3'] = add5(df1[2]) if df1[2] > 3

Hoping someone can point me in the right direction on this. Thanks!

jpp · Accepted Answer

With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.

In this case, you can use numpy.where:

df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])

Alternatively, you can use pd.DataFrame.loc in a couple of steps:

df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5

In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.

Result:

print(df1)

   0  1  2   3
0  1  2  3   3
1  5  7  8  13
2  2  5  4   9

Pandas Vectorization with Function on Parts of Column

Answers (1)

Related Questions