Reputation: 1480
I have a dataset of individuals organized by families. The families always (for this example) follow an order:father, mother, daughter, and son (the children can repeat). I would like to groupby data by family and create a dataset with only children with some information from the parents. Usually, I would use groupby and transform to create this new dataset. However, I don't have an ID for family and the final dataset will not be the same shape of the original dataset since it will exclude parents.
Here is a replicable minimal example,the desired output and what I have been attempting so far:
df=pd.DataFrame({'name':['Stephen Dedalus','Helena','Atena','Odisseu','Filoctetes','Medeia','Antígona','Tebas','Aqulies'],
'age':[30,32,15,16,45,43,12,13,6],
'can_read':['yes','no','yes']*3, 'nationality':['Irish','Greek','Brazilian']*3,
'parental_situation':['chefe', 'dependente-esposa','filha_dependente','filho_dependente',
'chefe', 'dependente-esposa','filha_dependente','filha_dependente',
'filho_dependente']})
df
So, my data is like:
name age can_read nationality parental_situation
0 Stephen Dedalus 30 yes Irish chefe
1 Helena 32 no Greek dependente-esposa
2 Atena 15 yes Brazilian filha_dependente
3 Odisseu 16 yes Irish filho_dependente
4 Filoctetes 45 no Greek chefe
5 Medeia 43 yes Brazilian dependente-esposa
6 Antígona 12 yes Irish filha_dependente
7 Tebas 13 no Greek filha_dependente
8 Aqulies 6 yes Brazilian filho_dependente
Desired output:
name age can_read nationality father_canread mother_canread father_nation mother_nation
Atena 15 yes Brazilian yes no Irish Greek
Odisseu 16 yes Irish yes no Irish Greek
Antígona 12 yes Irish no yes Greek Brazilian
Tebas 13 no Greek no yes Greek Brazilian
Aquiles 6 yes Brazilian no yes Greek Brazilian
What I have been trying
My solution to create the ID variable for each family is the folowing (very inneficient):
df['id']=2 #lastfamily
for k in df.index:
if list(df['parental_situation'][:k]).count('chefe')==1:
df['id'][k-1]=1
elif list(df['parental_situation'][:k]).count('chefe')==2:
df['id'][k-1]=2
Then I create a function to get the information from parents, and proceed to merge and append all sub-datasets:
def get_parents_info(family_id, father=True):
if father:
return df[(df['parental_situation']=='chefe') & (df['id']==family_id)].drop(['name','age','parental_situation'], axis=1).rename(columns={'can_read':'father_canread',
'nationality':'father_nationality'}).reset_index(drop=True)
elif father==False:
return df[(df['parental_situation']=='dependente-esposa') & (df['id']==family_id)].drop(['name','age',
'parental_situation'], axis=1).rename(columns={'can_read':'mother_canread',
'nationality':'mother_nationality'}).reset_index(drop=True)
condition_children = (df['parental_situation']=='filha_dependente') | (df['parental_situation']=='filho_dependente')
final=df.merge(get_parents_info(1,father=False)).merge(get_parents_info(1,father=True)).append(df.merge(get_parents_info(2,father=False)).merge(get_parents_info(2,father=True)))
final[condition_children]
Which gives me exactly what I want. But I think there must be a more efficient and elegant way to do that (my dataframe is rather a big one). Furthermore,it will be nice if there is a more flexible approach since in my real data the sequence of parental information is not so well behaved. The mother can be the family chief (if the man is dead, for example) or she can come first, even if there is a man as the family chief.
Any help will be appreciated!
P.S: sorry, chefe is equivalent to father, dependente-esposa is equivalent to mother. The remaining are all children
Upvotes: 1
Views: 164
Reputation: 142216
There's probably a more efficient way, but with some fiddling about to assign a unique group to each family and some pivoting and adding a column to the children you can kind of get what you want even though it's probably not as flat as you'd like...
# rename to mother/father and we can assume NaN to be child
df['category'] = df['parental_situation'].map({'chefe': 'father', 'dependente-esposa': 'mother'})
# assign a family group
is_parent = df['category'].notna()
df.loc[is_parent, 'group'] = (is_parent ^ is_parent.shift()).astype(int)
df['group'] = df['group'].fillna(0).cumsum()
# Pivot accordingly
parents = df[is_parent].pivot('group', 'category')
# Add extra level to child columns
children = df[~is_parent].set_index('group')
children.columns = pd.MultiIndex.from_product([children.columns, ['child']])
# Do a join...
new = children.join(parents)
This gives you:
name age can_read nationality parental_situation category name age can_read nationality parental_situation
child child child child child child father mother father mother father mother father mother father mother
group
1.0 Atena 15 yes Brazilian filha_dependente NaN Stephen Dedalus Helena 30 32 yes no Irish Greek chefe dependente-esposa
1.0 Odisseu 16 yes Irish filho_dependente NaN Stephen Dedalus Helena 30 32 yes no Irish Greek chefe dependente-esposa
2.0 Antígona 12 yes Irish filha_dependente NaN Filoctetes Medeia 45 43 no yes Greek Brazilian chefe dependente-esposa
2.0 Tebas 13 no Greek filha_dependente NaN Filoctetes Medeia 45 43 no yes Greek Brazilian chefe dependente-esposa
2.0 Aqulies 6 yes Brazilian filho_dependente NaN Filoctetes Medeia 45 43 no yes Greek Brazilian chefe dependente-esposa
You might want to do a swap level or drop some columns before joining but this should give you a general approach that gets you close.
Further thought is to restrict the values in the pivot and be explicit with the columns from the children frame, eg:
# Pivot accordingly
parents = df[is_parent].pivot('group', 'category', ['can_read', 'nationality'])
# Add extra level to child columns
children = df.loc[~is_parent, ['group', 'name', 'age', 'can_read', 'nationality']].set_index('group')
children.columns = pd.MultiIndex.from_product([children.columns, ['child']])
Will then give you:
name age can_read nationality can_read nationality
child child child child father mother father mother
group
1.0 Atena 15 yes Brazilian yes no Irish Greek
1.0 Odisseu 16 yes Irish yes no Irish Greek
2.0 Antígona 12 yes Irish no yes Greek Brazilian
2.0 Tebas 13 no Greek no yes Greek Brazilian
2.0 Aqulies 6 yes Brazilian no yes Greek Brazilian
... which is much closer to your desired result.
Upvotes: 2
Reputation: 294508
cumsum
Find where 'parental_situation'
is 'chefe'
and use cumsum
to create contiguous groups
grp_num = df.parental_situation.eq('chefe').cumsum()
We can see how this works by displaying the dataframe with this column
df.assign(Group=grp_num)
name age can_read nationality parental_situation Group
0 Stephen Dedalus 30 yes Irish chefe 1
1 Helena 32 no Greek dependente-esposa 1
2 Atena 15 yes Brazilian filha_dependente 1
3 Odisseu 16 yes Irish filho_dependente 1
4 Filoctetes 45 no Greek chefe 2
5 Medeia 43 yes Brazilian dependente-esposa 2
6 Antígona 12 yes Irish filha_dependente 2
7 Tebas 13 no Greek filha_dependente 2
8 Aqulies 6 yes Brazilian filho_dependente 2
Or we can create a groupby
object and a subsequent dictionary.
groups = df.groupby(grp_num)
families = {g: family for g, family in groups}
For the purposes of demonstration, I'll print them out
print(*families.values(), sep='\n\n')
name age can_read nationality parental_situation
0 Stephen Dedalus 30 yes Irish chefe
1 Helena 32 no Greek dependente-esposa
2 Atena 15 yes Brazilian filha_dependente
3 Odisseu 16 yes Irish filho_dependente
name age can_read nationality parental_situation
4 Filoctetes 45 no Greek chefe
5 Medeia 43 yes Brazilian dependente-esposa
6 Antígona 12 yes Irish filha_dependente
7 Tebas 13 no Greek filha_dependente
8 Aqulies 6 yes Brazilian filho_dependente
Upvotes: 1