Javier Lorenzini
Javier Lorenzini

Reputation: 47

Learning to use map in Pandas

I want to model this excel formula for two dataframe columns, df1 and df2, row by row:

=IF(df1 > df2; df2; df2 - df1)

I know this can be done with map in Python, but don't know how. I can do it with a for loop and an if-else statement, but it makes my code harder to read.

Can you help me?

Upvotes: 1

Views: 287

Answers (3)

fuglede
fuglede

Reputation: 18211

This can be achieved through

df2 - df1*(df2 > df1)

For example,

In [51]: df1
Out[51]: 
0    0.588922
1    0.876270
2    0.276917
3    0.521514
dtype: float64

In [52]: df2
Out[52]: 
0    0.628234
1    0.492461
2    0.782199
3    0.759758
dtype: float64

In [53]: df2 - df1*(df2 > df1)
Out[53]: 
0    0.039312
1    0.492461
2    0.505282
3    0.238245
dtype: float64

Performance-wise, the three solutions posted so far behave as follows on a (100000, 2)-shaped DataFrame:

In [124]: df = pd.DataFrame()

In [125]: df['df1'] = np.random.uniform(size=10**5)

In [126]: df['df2'] = np.random.uniform(size=10**5)

In [127]: df1 = df.df1

In [128]: df2 = df.df2

In [130]: %timeit df.apply(lambda x : x["df2"] if x["df1"] > x["df2"] else x["df2"] - x["df1"], axis=1)
1 loop, best of 3: 8.97 s per loop

In [131]: %timeit df2.where(df1 > df2, df2 - df1)
100 loops, best of 3: 3.57 ms per loop

In [132]: %timeit df2 - df1*(df2 > df1)
1000 loops, best of 3: 1.61 ms per loop

Upvotes: 0

sparc_spread
sparc_spread

Reputation: 10843

I think what you are saying is create a new column that is a function of the columns df1 and df2 in the way in which you describe. (If not correct me if I'm wrong.)

Then this is a case of using the apply function, something you should get to know well for using pandas. Assume your dataframe is called df and your new column will be called df3. Then:

df3 = df.apply(lambda x : x["df2"] if x["df1"] > x["df2"] else x["df2"] - x["df1"], axis=1)

There's a whole lot of stuff going on here, especially because you are applying to the whole DataFrame (usually you see apply used with just one column as input - technically, apply to a Series - and the syntax is simpler.) The axis=1 is required to explain to pandas that this should be applied to sets of columns (you would think that would be the default but it is not, so always remember to use it when you are applying to the whole DataFrame.) The lambda is used to access the two columns and carry out the function; here, x is abstractly any row of the DataFrame itself. Basically, this is a way of doing a vectorized mapping. It is one of the most common tools in pandas and you'll find yourself using it a lot.

A good basic apply tutorial is here, showing mostly applications to a single column (with the simpler syntax I mentioned, where you don't need axis=1.) Here also is another example of applying to multiple columns.

Upvotes: 1

user2285236
user2285236

Reputation:

That's actually a job for mask or where.

df2.mask(df1 <= df2, df2 - df1)

Or,

df2.where(df1 > df2, df2 - df1)

More generally, this is from numpy:

np.where(df1 > df2, df2, df2 - df1)

For these sample DataFrames:

df1
Out: 
          A         B         C
0  0.446762 -0.435975  0.109038
1 -0.729108  3.670354  0.761667
2 -0.244370 -0.256956 -1.831161

df2
Out: 
          A         B         C
0 -1.192108  0.074628 -0.087634
1 -0.324098  0.698479 -0.287896
2  1.807863 -2.564992 -2.361296

The first two yield

          A         B         C
0 -1.192108  0.510603 -0.087634
1  0.405010  0.698479 -0.287896
2  2.052233 -2.564992 -2.361296

np.where returns a numpy array so you might need to convert it back to a DataFrame:

array([[-1.19210755,  0.51060284, -0.08763422],
       [ 0.40500973,  0.69847936, -0.28789618],
       [ 2.05223294, -2.56499239, -2.36129577]])

Upvotes: 5

Related Questions