Reputation: 79
Learning column comparison. How to create a new column based on two columns?
I can do two conditions fruit or vegetable. But for the third condition, couldn’t do. :(
df
basket1 basket2
0 fruit fruit
1 vegetable vegetable
2 vegetable both
3 fruit both
The result
Newdf
basket1 basket2 total
0 fruit fruit fruit
1 vegetable vegetable vegetable
2 vegetable both Unknown
3 fruit both fruit
Thanks a lot for your help!
Upvotes: 1
Views: 10202
Reputation: 3801
Update
Revisiting this, DataFrame.apply
is slow AF. Let's look at some other options and then compare.
DataFrame.apply
numpy.where
This method can be applied when we only have two options. In your case, this is true as we return df.a
when df.a == df.b
or df.a == 'fruit' and df.b == 'both'
. The syntax is np.where(condition, value_if_true, value_if_false)
.
In [42]: df['np_where'] = np.where(
...: ((df.a == df.b) | ((df.a == 'fruit') & (df.b == 'both'))),
...: df.a,
...: 'Unknown'
...: )
numpy.select
You would use this option if you had multiple conditions. The syntax for this is np.select(condition, values, default)
where default
is an optional parameter.
In [43]: conditions = df.a == df.b, (df.a == 'fruit') & (df.b == 'both')
In [44]: choices = df['a'], df['a']
In [45]: df['np_select'] = np.select(conditions, choices, default='Unknown')
Note that for the purposes of demonstration I have created two conditions even if the outcomes yield the same result.
As you can see, all three methods have the same outcome.
In [47]: df
Out[47]:
a b np_where np_select df_apply
0 fruit fruit fruit fruit fruit
1 vegetable vegetable vegetable vegetable vegetable
2 vegetable both Unknown Unknown Unknown
3 fruit both fruit fruit fruit
But how do they compare in terms of speed? To check this, let's create a newer, larger DataFrame
. We're doing this to see how our options perform with larger amounts of data.
In [48]: df_large = pd.DataFrame({
...: 'a': np.random.choice(['fruit', 'vegetable'], size=1_000_000),
...: 'b': np.random.choice(['fruit', 'vegetable', 'both'], size=1_000_000)
...: })
In [49]: %timeit df_large['np_where'] = np.where(((df_large.a == df_large.b) | ((df_large.a == 'fruit')
...: & (df_large.b == 'both'))), df_large.a, 'Unknown')
379 ms ± 64.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [50]: %timeit df_large['np_select'] = np.select(((df_large.a == df_large.b), ((df_large.a == 'fruit'
...: ) & (df_large.b == 'both'))), (df_large.a, df_large.a), default='Unknown')
580 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [51]: %timeit df_large['df_apply'] = df_large.apply(total, axis=1)
40.5 s ± 6 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Wow! As you can see DataFrame.apply
is much slower than our other two options and np.where
edges out np.select
.
np.where
if you have only two choicesnp.select
if you have multiple choicesDataFrame.apply
(especially for large data sets)!Create your own function and use DataFrame.apply
In [104]: def total(r):
...: if r.a == r.b:
...: return r.a
...: elif (r.a == 'fruit') and (r.b == 'both'):
...: return r.a
...: return 'Unknown'
...:
In [105]: df = pd.DataFrame({'a': ['fruit', 'vegetable', 'vegetable', 'fruit'], 'b': ['fruit', 'vegetable', 'both', 'both']})
In [106]: df
Out[106]:
a b
0 fruit fruit
1 vegetable vegetable
2 vegetable both
3 fruit both
In [107]: df['total'] = df.apply(total, axis=1)
In [108]: df
Out[108]:
a b total
0 fruit fruit fruit
1 vegetable vegetable vegetable
2 vegetable both Unknown
3 fruit both fruit
Upvotes: 7
Reputation: 1824
df["total"] = df.apply(lambda x: x.a if (x.a == x.b) or ((x.a == 'fruit') and (x.b == 'both')) else 'Unkonwn', axis = 1)
Output
a b total
0 fruit fruit fruit
1 vegetable vegetable vegetable
2 vegetable both Unkonwn
3 fruit both fruit
Upvotes: 3
Reputation: 13401
Here's solution using np.select
df['total'] = np.select([df['a']==df['b'], (df['a']=='fruit')&(df['b']=='both')], [df['a'], 'fruit'], 'Unkown')
Output:
a b total
0 fruit fruit fruit
1 vegetable vegetable vegetable
2 vegetable both Unknown
3 fruit both fruit
Upvotes: 1
Reputation: 31040
data.csv
basket1,basket2
fruit,fruit
vegetable,vegetable
vegetable,both
fruit,both
code.py
import pandas as pd
df = pd.read_csv('data.csv')
for i, r in df.iterrows():
if df.at[i, 'basket1'] == df.at[i, 'basket2']:
df.at[i, 'total'] = df.at[i, 'basket1']
else:
df.at[i, 'total'] = 'something else'
Output:
basket1 basket2 total
0 fruit fruit fruit
1 vegetable vegetable vegetable
2 vegetable both something else
3 fruit both something else
Upvotes: 0