matnor
matnor

Reputation: 511

Create a variable in a Pandas dataframe based on information in the dataframe

I have a dataframe organized in the following way

    var1   var2   var3   var4
0   A      23     B      7
1   B      13     C      4
2   C      12     A      11
3   A      5      C      15

I now want to create a new variable (column), var5, which takes the value of var2 if var1 == A and the value of var4 if var3 == A. For simplicity, var1 and var3 can never both have the value A. If neither var1 or var3 takes value A, then I want NaN. That is, the outcome in this example would be:

    var1   var2   var3   var4  var5
0   A      23     B      7     23
1   B      13     C      4     NaN
2   C      12     A      11    11
3   A      5      C      15    5

How can this be achieved?

Upvotes: 2

Views: 12242

Answers (2)

Tai
Tai

Reputation: 7994

Throw out my simple yet might not be fast answer. (See the comments and other answers if aiming for performance.

df = pd.DataFrame([['A', 23, 'B', 7], ['B', '13', 'C', 4], 
                   ['c', 12, 'A', 11], ['A', 5, 'C', 15]],
                   columns=['v1', 'v2', 'v3', 'v4'])

def get_val(row):
    if row.v1 == 'A':
        return row.v2
    elif row.v3 == 'A':
        return row.v4
    else:
        return np.nan

df["v5"] = df.apply(get_val, axis=1)

What the code did is it defines a function to return a value based on each row. Use apply with the function.

enter image description here

Upvotes: 1

cs95
cs95

Reputation: 402814

Option 1
Sounds like you can use np.where for this -

i = df.var1 == 'A'
j = df.var3 == 'A'
df['var5'] = np.where(i, df.var2, np.where(j, df.var4, np.NaN))
df

  var1  var2 var3  var4  var5
0    A    23    B     7  23.0
1    B    13    C     4   NaN
2    C    12    A    11  11.0
3    A     5    C    15   5.0

Option 2
An alternative would be np.select -

df['var5'] = np.select([i, j], [df.var2, df.var4], default=np.nan)
df

  var1  var2 var3  var4  var5
0    A    23    B     7  23.0
1    B    13    C     4   NaN
2    C    12    A    11  11.0
3    A     5    C    15   5.0

Note, i and j are the same variables defined in the code listing for Option 1.


Option 3
pd.Series.mask/where

df.var2.mask(~i, df.var4.mask(~j, np.nan))

0    23.0
1     NaN
2    11.0
3     5.0
Name: var2, dtype: float64

Upvotes: 4

Related Questions