Reputation: 47
I want to model this excel formula for two dataframe columns, df1 and df2, row by row:
=IF(df1 > df2; df2; df2 - df1)
I know this can be done with map in Python, but don't know how. I can do it with a for loop and an if-else statement, but it makes my code harder to read.
Can you help me?
Upvotes: 1
Views: 287
Reputation: 18211
This can be achieved through
df2 - df1*(df2 > df1)
For example,
In [51]: df1
Out[51]:
0 0.588922
1 0.876270
2 0.276917
3 0.521514
dtype: float64
In [52]: df2
Out[52]:
0 0.628234
1 0.492461
2 0.782199
3 0.759758
dtype: float64
In [53]: df2 - df1*(df2 > df1)
Out[53]:
0 0.039312
1 0.492461
2 0.505282
3 0.238245
dtype: float64
Performance-wise, the three solutions posted so far behave as follows on a (100000, 2)-shaped DataFrame:
In [124]: df = pd.DataFrame()
In [125]: df['df1'] = np.random.uniform(size=10**5)
In [126]: df['df2'] = np.random.uniform(size=10**5)
In [127]: df1 = df.df1
In [128]: df2 = df.df2
In [130]: %timeit df.apply(lambda x : x["df2"] if x["df1"] > x["df2"] else x["df2"] - x["df1"], axis=1)
1 loop, best of 3: 8.97 s per loop
In [131]: %timeit df2.where(df1 > df2, df2 - df1)
100 loops, best of 3: 3.57 ms per loop
In [132]: %timeit df2 - df1*(df2 > df1)
1000 loops, best of 3: 1.61 ms per loop
Upvotes: 0
Reputation: 10843
I think what you are saying is create a new column that is a function of the columns df1
and df2
in the way in which you describe. (If not correct me if I'm wrong.)
Then this is a case of using the apply
function, something you should get to know well for using pandas. Assume your dataframe is called df
and your new column will be called df3
. Then:
df3 = df.apply(lambda x : x["df2"] if x["df1"] > x["df2"] else x["df2"] - x["df1"], axis=1)
There's a whole lot of stuff going on here, especially because you are applying to the whole DataFrame
(usually you see apply
used with just one column as input - technically, apply
to a Series
- and the syntax is simpler.) The axis=1
is required to explain to pandas that this should be applied to sets of columns (you would think that would be the default but it is not, so always remember to use it when you are applying to the whole DataFrame
.) The lambda
is used to access the two columns and carry out the function; here, x
is abstractly any row of the DataFrame
itself. Basically, this is a way of doing a vectorized mapping. It is one of the most common tools in pandas and you'll find yourself using it a lot.
A good basic apply
tutorial is here, showing mostly applications to a single column (with the simpler syntax I mentioned, where you don't need axis=1
.) Here also is another example of applying to multiple columns.
Upvotes: 1
Reputation:
That's actually a job for mask
or where
.
df2.mask(df1 <= df2, df2 - df1)
Or,
df2.where(df1 > df2, df2 - df1)
More generally, this is from numpy:
np.where(df1 > df2, df2, df2 - df1)
For these sample DataFrames:
df1
Out:
A B C
0 0.446762 -0.435975 0.109038
1 -0.729108 3.670354 0.761667
2 -0.244370 -0.256956 -1.831161
df2
Out:
A B C
0 -1.192108 0.074628 -0.087634
1 -0.324098 0.698479 -0.287896
2 1.807863 -2.564992 -2.361296
The first two yield
A B C
0 -1.192108 0.510603 -0.087634
1 0.405010 0.698479 -0.287896
2 2.052233 -2.564992 -2.361296
np.where
returns a numpy array so you might need to convert it back to a DataFrame:
array([[-1.19210755, 0.51060284, -0.08763422],
[ 0.40500973, 0.69847936, -0.28789618],
[ 2.05223294, -2.56499239, -2.36129577]])
Upvotes: 5