Reputation: 511
I have a dataframe organized in the following way
var1 var2 var3 var4
0 A 23 B 7
1 B 13 C 4
2 C 12 A 11
3 A 5 C 15
I now want to create a new variable (column), var5, which takes the value of var2 if var1 == A and the value of var4 if var3 == A. For simplicity, var1 and var3 can never both have the value A. If neither var1 or var3 takes value A, then I want NaN. That is, the outcome in this example would be:
var1 var2 var3 var4 var5
0 A 23 B 7 23
1 B 13 C 4 NaN
2 C 12 A 11 11
3 A 5 C 15 5
How can this be achieved?
Upvotes: 2
Views: 12242
Reputation: 7994
Throw out my simple yet might not be fast answer. (See the comments and other answers if aiming for performance.
df = pd.DataFrame([['A', 23, 'B', 7], ['B', '13', 'C', 4],
['c', 12, 'A', 11], ['A', 5, 'C', 15]],
columns=['v1', 'v2', 'v3', 'v4'])
def get_val(row):
if row.v1 == 'A':
return row.v2
elif row.v3 == 'A':
return row.v4
else:
return np.nan
df["v5"] = df.apply(get_val, axis=1)
What the code did is it defines a function to return a value based on each row. Use apply
with the function.
Upvotes: 1
Reputation: 402814
Option 1
Sounds like you can use np.where
for this -
i = df.var1 == 'A'
j = df.var3 == 'A'
df['var5'] = np.where(i, df.var2, np.where(j, df.var4, np.NaN))
df
var1 var2 var3 var4 var5
0 A 23 B 7 23.0
1 B 13 C 4 NaN
2 C 12 A 11 11.0
3 A 5 C 15 5.0
Option 2
An alternative would be np.select
-
df['var5'] = np.select([i, j], [df.var2, df.var4], default=np.nan)
df
var1 var2 var3 var4 var5
0 A 23 B 7 23.0
1 B 13 C 4 NaN
2 C 12 A 11 11.0
3 A 5 C 15 5.0
Note, i
and j
are the same variables defined in the code listing for Option 1.
Option 3
pd.Series.mask
/where
df.var2.mask(~i, df.var4.mask(~j, np.nan))
0 23.0
1 NaN
2 11.0
3 5.0
Name: var2, dtype: float64
Upvotes: 4