Create a variable in a Pandas dataframe based on information in the dataframe

Question

I have a dataframe organized in the following way

    var1   var2   var3   var4
0   A      23     B      7
1   B      13     C      4
2   C      12     A      11
3   A      5      C      15

I now want to create a new variable (column), var5, which takes the value of var2 if var1 == A and the value of var4 if var3 == A. For simplicity, var1 and var3 can never both have the value A. If neither var1 or var3 takes value A, then I want NaN. That is, the outcome in this example would be:

    var1   var2   var3   var4  var5
0   A      23     B      7     23
1   B      13     C      4     NaN
2   C      12     A      11    11
3   A      5      C      15    5

How can this be achieved?

cs95 · Accepted Answer

Option 1
Sounds like you can use np.where for this -

i = df.var1 == 'A'
j = df.var3 == 'A'

df['var5'] = np.where(i, df.var2, np.where(j, df.var4, np.NaN))
df

  var1  var2 var3  var4  var5
0    A    23    B     7  23.0
1    B    13    C     4   NaN
2    C    12    A    11  11.0
3    A     5    C    15   5.0

Option 2
An alternative would be np.select -

df['var5'] = np.select([i, j], [df.var2, df.var4], default=np.nan)
df

  var1  var2 var3  var4  var5
0    A    23    B     7  23.0
1    B    13    C     4   NaN
2    C    12    A    11  11.0
3    A     5    C    15   5.0

Note, i and j are the same variables defined in the code listing for Option 1.

Option 3
pd.Series.mask/where

df.var2.mask(~i, df.var4.mask(~j, np.nan))

0    23.0
1     NaN
2    11.0
3     5.0
Name: var2, dtype: float64

Create a variable in a Pandas dataframe based on information in the dataframe

Answers (2)

Related Questions