Reputation: 2543
I have written some pandas code that is equivalent to this toy example:
df_test = pd.DataFrame({'product': [0, 0, 1, 1], 'sold_for': [5000, 4500, 10000, 8000]})
def product0_makes_profit(row, product0_cost):
return row['sold_for'] > product0_cost
def product1_makes_profit(row, product1_cost):
return row['sold_for'] > product1_cost
df_test['made_profit'] = df_test[df_test['product']==0].apply(product0_makes_profit, args=[4000], axis=1, result_type="expand")
df_test['made_profit'] = df_test[df_test['product']==1].apply(product1_makes_profit, args=[9000], axis=1, result_type="expand")
df_test
I get the following result:
product sold_for made_profit
0 0 5000 NaN
1 0 4500 NaN
2 1 10000 True
3 1 8000 False
I would expect that column 'made_profit' is True for row 0 and 1, instead of NaN, but apparently the second apply() overwrites the made_profit column, made by the first apply().
How can I get the column I expect? I don't want to make a column 'product0_made_profit' with the first apply() and a column 'product1_made_profit' with the second apply(), so I can merge both columns into the one 'made_profit' column that I want to obtain, since in my real code, I have a lot of different values in the product column (meaning lots of different functions to apply).
EDIT
I made my toy example too simple, I actually create two new columns:
def product0_makes_profit(row, product0_cost):
return [row['sold_for'] > product0_cost, row['sold_for'] - product0_cost]
def product1_makes_profit(row, product1_cost):
return [row['sold_for'] > product1_cost, row['sold_for'] - product1_cost]
Using the current answer, I made this:
is_prod0 = (df_test['product']==0)
df_test.loc[is_prod0, ['made_profit', 'profit_amount']] = df_test[is_prod0].apply(product0_makes_profit, args=[4000], axis=1, result_type="expand")
is_prod1 = (df_test['product']==1)
df_test.loc[is_profd1, ['made_profit', 'profit_amount']] = df_test[is_prod1].apply(product1_makes_profit, args=[9000], axis=1, result_type="expand")
print(df_test)
But that gives me the following error (on the first use of .loc):
KeyError: "None of [Index(['made_profit', 'profit_amount'], dtype='object')] are in the [columns]"
I can make it work with the following code:
is_prod0 = (df_test['product']==0)
newdf = df_test[is_prod0].apply(product0_makes_profit, args=[4000], axis=1, result_type="expand")
is_prod1 = (df_test['product']==1)
newerdf = df_test[is_prod1].apply(product1_makes_profit, args=[9000], axis=1, result_type="expand")
newcols = pd.concat([newdf, newerdf])
newcols.columns = ['was_profit_made', 'profit_amount']
df_test.join(newcols)
However, this involves concat() and join() and as said above, gets a little tedious on the real-life code (but feasible by building a loop over all product values) - maybe there is an elegant solution for multiple columns too.
Upvotes: 2
Views: 104
Reputation: 862406
You need assign to filtered rows with same condition with loc
, so processes only rows if conditions is True
:
m1 = df_test['product']==0
m2 = df_test['product']==1
df_test.loc[m1, 'made_profit'] = df_test[m1].apply(product0_makes_profit, args=[4000], axis=1, result_type="expand")
df_test.loc[m2, 'made_profit'] = df_test[m2].apply(product1_makes_profit, args=[9000], axis=1, result_type="expand")
print (df_test)
product sold_for made_profit
0 0 5000 True
1 0 4500 True
2 1 10000 True
3 1 8000 False
EDIT:
If return multiple values from function
need return Series
with index by new columns names, also is necessary create new columns filled some default value (e.g. NaN
) before loc
:
cols = ['made_profit', 'profit_amount']
def product0_makes_profit(row, product0_cost):
return pd.Series([row['sold_for'] > product0_cost, row['sold_for'] - product0_cost], index=cols)
def product1_makes_profit(row, product1_cost):
return pd.Series([row['sold_for'] > product1_cost, row['sold_for'] - product1_cost], index=cols)
for c in cols:
df_test[c] = np.nan
is_prod0 = (df_test['product']==0)
df_test.loc[is_prod0, cols] = df_test[is_prod0].apply(product0_makes_profit, args=[4000], axis=1, result_type="expand")
is_prod1 = (df_test['product']==1)
df_test.loc[is_prod1, cols] = df_test[is_prod1].apply(product1_makes_profit, args=[9000], axis=1, result_type="expand")
print(df_test)
product sold_for made_profit profit_amount
0 0 5000 True 1000.0
1 0 4500 True 500.0
2 1 10000 True 1000.0
3 1 8000 False -1000.0
Upvotes: 2