Reputation: 23
This particular dataset consists of 3 households and its members. Columns 3 and 4 indicate if that member lives with their parents. Its value is the identity of mother and father in the Member column. For example member 3 lives with mother (2) and father (1).
Household Member Lives_with_m Lives_with_f New_col
1 1 NaN NaN 1
1 2 NaN NaN 1
1 3 2 1 1
1 4 NaN NaN 1
1 5 3 4 NaN
2 1 3 NaN NaN
2 2 NaN NaN 1
2 3 2 NaN 2
2 4 3 NaN NaN
3 1 3 2 NaN
3 2 NaN NaN 1
3 3 NaN NaN 1
I've been trying to create a new column with the total number of kids that every member has, which implies relating the values of columns 3 and 4 with the second one, always for the same household.
I've tried achieving this with a for loop starting only with 'Lives_with_m' and I realized its quite slow. Most importantly, I couldn't get the expected results.
for p in set(d['Household']):
u = d['Member']
v = d['Lives_with_m']
if bool(set(u) & set(v)):
d['New_col'] = 0
I'm not sure if i'm going in the right direction... Any help would be very much appreciated!
d = {'group': [1, 1, 1, 1, 1 ,2, 2, 2, 2, 3, 3, 3], 'member': [1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 2, 3], 'lives_with_m': [np.nan, np.nan, 2, np.nan, 3, 3, np.nan, 2, 3, 3, np.nan, np.nan], 'lives_with_f': [np.nan, np.nan, 1, np.nan, 4, np.nan, np.nan, np.nan, np.nan, 2, np.nan, np.nan }
pd.DataFrame(data= d)
Upvotes: 2
Views: 83
Reputation: 14184
One way to go, would be as follows:
import pandas as pd
import numpy as np
d = {'group': [1, 1, 1, 1, 1 ,2, 2, 2, 2, 3, 3, 3], 'member': [1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 2, 3], 'lives_with_m': [np.nan, np.nan, 2, np.nan, 3, 3, np.nan, 2, 3, 3, np.nan, np.nan], 'lives_with_f': [np.nan, np.nan, 1, np.nan, 4, np.nan, np.nan, np.nan, np.nan, 2, np.nan, np.nan] }
df = pd.DataFrame(data= d)
value_counts = df.groupby('group')[['lives_with_m','lives_with_f']].agg({i:'value_counts' for i in ['lives_with_m','lives_with_f']}).apply(lambda x: np.nansum(x), axis=1)
value_counts.name = 'new'
df = df.merge(value_counts, how='left', left_on=['group','member'], right_index=True)
df
group member lives_with_m lives_with_f new
0 1 1 NaN NaN 1.0
1 1 2 NaN NaN 1.0
2 1 3 2.0 1.0 1.0
3 1 4 NaN NaN 1.0
4 1 5 3.0 4.0 NaN
5 2 1 3.0 NaN NaN
6 2 2 NaN NaN 1.0
7 2 3 2.0 NaN 2.0
8 2 4 3.0 NaN NaN
9 3 1 3.0 2.0 NaN
10 3 2 NaN NaN 1.0
11 3 3 NaN NaN 1.0
df.groupby
to isolate ['lives_with_m','lives_with_f']
per group.df.agg
to a retrieve df.value_counts
for both columns.np.nansum
to each row to end up with a pd.Series
looking like this:print(value_counts)
group
1 2.0 1.0
3.0 1.0
2 3.0 2.0
2.0 1.0
3 3.0 1.0
1 1.0 1.0
4.0 1.0
3 2.0 1.0
Name: new, dtype: float64
At this stage, we just need to merge the original df
with our series
on the matching columns (left: ['group','member']
, right: s.index
).
Upvotes: 1