Conditional creation of a Dataframe column, where the calculation of the column values change based on row input

Question

I have a very long and wide dataframe. I'd like to create a new column in that dataframe, where the value depends on many other columns in the df. The calculation needed for the values in this new column, ALSO change, depending on a value in some other column.

The answers to this question and this question come close, but don't quite work out for me.

I'll eventually have about 30 different calculations that could be applied, so I'm not too keen on the np.where function, which is not that readible for too many conditions.

I've also been strongly adviced against doing a for-loop over all rows in a dataframe, because it's supposed to be awful for performance (please correct me if I'm wrong there).

What I've tried to do instead:

import pandas as pd
import numpy as np

# Information in my columns look something like this:
df['text'] = ['dab', 'def', 'bla', 'zdag', 'etc']
df['values1'] = [3 , 4, 2, 5, 2]
df['values2'] = [6, 3, 21, 44, 22]
df['values3'] = [103, 444, 33, 425, 200]

# lists to check against to decide upon which calculation is required
someList = ['dab', 'bla']
someOtherList = ['def', 'zdag']
someThirdList = ['etc']

conditions = [
    (df['text'] is None),
    (df['text'] in someList),
    (df['text'] in someOtherList),
    (df['text'] in someThirdList)]
choices = [0, 
           round(df['values2'] * 0.5 * df['values3'], 2), 
           df['values1'] + df['values2'] - df['values3'], 
           df['values1'] + 249]
df['mynewvalue'] = np.select(conditions, choices, default=0)
print(df)

I expect that based on the row values in the df['text'], the right calculation is applied to same row value of df['mynewvalue'].

Instead, I get the error The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How can I program this instead, so that I can use these kind of conditions to define the right calculation for this df['mynewvalue'] column?

Alexandre B. · Accepted Answer

The errors come from the conditions:

conditions = [
    ... ,
    (df['text'] in someList),
    (df['text'] in someOtherList),
    (df['text'] in someThirdList)]

You try to ask if several elements are in a list. The answer is a list (for each element). As the error suggests, you have to decide if the condition is verified when at least one element verify the property (any) or if all the elements verify the property (any).

One solution is to use isin (doc) or all (doc) for pandas dataframes.

Here using any:

import pandas as pd
import numpy as np

# Information in my columns look something like this:
df = pd.DataFrame()

df['text'] = ['dab', 'def', 'bla', 'zdag', 'etc']
df['values1'] = [3, 4, 2, 5, 2]
df['values2'] = [6, 3, 21, 44, 22]
df['values3'] = [103, 444, 33, 425, 200]

# other lists to test against whether
someList = ['dab', 'bla']
someOtherList = ['def', 'zdag']
someThirdList = ['etc']

conditions = [
    (df['text'] is None),
    (df['text'].isin(someList)),
    (df['text'].isin(someOtherList)),
    (df['text'].isin(someThirdList))]
choices = [0,
           round(df['values2'] * 0.5 * df['values3'], 2),
           df['values1'] + df['values2'] - df['values3'],
           df['values1'] + 249]
df['mynewvalue'] = np.select(conditions, choices, default=0)
print(df)
#    text  values1  values2  values3  mynewvalue
# 0   dab        3        6      103       309.0
# 1   def        4        3      444      -437.0
# 2   bla        2       21       33       346.5
# 3  zdag        5       44      425      -376.0
# 4   etc        2       22      200       251.0

Conditional creation of a Dataframe column, where the calculation of the column values change based on row input

Answers (1)

Related Questions