Pythonista
Pythonista

Reputation: 79

Pandas: To create a new column based on two column values

Learning column comparison. How to create a new column based on two columns?

I can do two conditions fruit or vegetable. But for the third condition, couldn’t do. :(

df
    basket1     basket2
0   fruit       fruit
1   vegetable   vegetable 
2   vegetable   both
3   fruit       both

The result

Newdf

    basket1    basket2    total
0   fruit      fruit      fruit
1   vegetable  vegetable  vegetable  
2   vegetable  both       Unknown
3   fruit      both      fruit

Thanks a lot for your help!

Upvotes: 1

Views: 10202

Answers (4)

aydow
aydow

Reputation: 3801

Update

Revisiting this, DataFrame.apply is slow AF. Let's look at some other options and then compare.

Other options to DataFrame.apply

  1. numpy.where

This method can be applied when we only have two options. In your case, this is true as we return df.a when df.a == df.b or df.a == 'fruit' and df.b == 'both'. The syntax is np.where(condition, value_if_true, value_if_false).

In [42]: df['np_where'] = np.where(
    ...:     ((df.a == df.b) | ((df.a == 'fruit') & (df.b == 'both'))),
    ...:     df.a,
    ...:     'Unknown'
    ...: )
  1. numpy.select

You would use this option if you had multiple conditions. The syntax for this is np.select(condition, values, default) where default is an optional parameter.

In [43]: conditions = df.a == df.b, (df.a == 'fruit') & (df.b == 'both')

In [44]: choices = df['a'], df['a']

In [45]: df['np_select'] = np.select(conditions, choices, default='Unknown')

Note that for the purposes of demonstration I have created two conditions even if the outcomes yield the same result.

Comparing the options

As you can see, all three methods have the same outcome.

In [47]: df
Out[47]:
           a          b   np_where  np_select   df_apply
0      fruit      fruit      fruit      fruit      fruit
1  vegetable  vegetable  vegetable  vegetable  vegetable
2  vegetable       both    Unknown    Unknown    Unknown
3      fruit       both      fruit      fruit      fruit

But how do they compare in terms of speed? To check this, let's create a newer, larger DataFrame. We're doing this to see how our options perform with larger amounts of data.

In [48]: df_large = pd.DataFrame({
    ...:     'a': np.random.choice(['fruit', 'vegetable'], size=1_000_000),
    ...:     'b': np.random.choice(['fruit', 'vegetable', 'both'], size=1_000_000)
    ...: })

In [49]: %timeit df_large['np_where'] = np.where(((df_large.a == df_large.b) | ((df_large.a == 'fruit')
    ...:  & (df_large.b == 'both'))), df_large.a, 'Unknown')
379 ms ± 64.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [50]: %timeit df_large['np_select'] = np.select(((df_large.a == df_large.b), ((df_large.a == 'fruit'
    ...: ) & (df_large.b == 'both'))), (df_large.a, df_large.a), default='Unknown')
580 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %timeit df_large['df_apply'] = df_large.apply(total, axis=1)
40.5 s ± 6 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Wow! As you can see DataFrame.apply is much slower than our other two options and np.where edges out np.select.

Conclusions

  • Use np.where if you have only two choices
  • Use np.select if you have multiple choices
  • Don't use DataFrame.apply (especially for large data sets)!

Resources


Create your own function and use DataFrame.apply

In [104]: def total(r):
     ...:     if r.a == r.b:
     ...:         return r.a
     ...:     elif (r.a == 'fruit') and (r.b == 'both'):
     ...:         return r.a
     ...:     return 'Unknown'
     ...:

In [105]: df = pd.DataFrame({'a': ['fruit', 'vegetable', 'vegetable', 'fruit'], 'b': ['fruit', 'vegetable', 'both', 'both']})

In [106]: df
Out[106]:
           a          b
0      fruit      fruit
1  vegetable  vegetable
2  vegetable       both
3      fruit       both

In [107]: df['total'] = df.apply(total, axis=1)

In [108]: df
Out[108]:
           a          b      total
0      fruit      fruit      fruit
1  vegetable  vegetable  vegetable
2  vegetable       both    Unknown
3      fruit       both      fruit

Upvotes: 7

Srce Cde
Srce Cde

Reputation: 1824

df["total"] = df.apply(lambda x: x.a if (x.a == x.b) or ((x.a == 'fruit') and (x.b == 'both')) else 'Unkonwn', axis = 1)

Output

           a          b      total
0      fruit      fruit      fruit
1  vegetable  vegetable  vegetable
2  vegetable       both    Unkonwn
3      fruit       both      fruit

Upvotes: 3

Sociopath
Sociopath

Reputation: 13401

Here's solution using np.select

df['total'] = np.select([df['a']==df['b'], (df['a']=='fruit')&(df['b']=='both')], [df['a'], 'fruit'], 'Unkown')

Output:

    a          b           total
0   fruit      fruit       fruit
1   vegetable  vegetable   vegetable
2   vegetable  both        Unknown 
3   fruit      both        fruit

Upvotes: 1

Conner
Conner

Reputation: 31040

data.csv

basket1,basket2
fruit,fruit
vegetable,vegetable
vegetable,both
fruit,both

code.py

import pandas as pd

df = pd.read_csv('data.csv')
for i, r in df.iterrows():
    if df.at[i, 'basket1'] == df.at[i, 'basket2']:
        df.at[i, 'total'] = df.at[i, 'basket1']
    else:
        df.at[i, 'total'] = 'something else'

Output:

     basket1    basket2           total
0      fruit      fruit           fruit
1  vegetable  vegetable       vegetable
2  vegetable       both  something else
3      fruit       both  something else

Upvotes: 0

Related Questions