How to groupby when there is a pattern by group but not an ID?

Question

I have a dataset of individuals organized by families. The families always (for this example) follow an order:father, mother, daughter, and son (the children can repeat). I would like to groupby data by family and create a dataset with only children with some information from the parents. Usually, I would use groupby and transform to create this new dataset. However, I don't have an ID for family and the final dataset will not be the same shape of the original dataset since it will exclude parents.

Here is a replicable minimal example,the desired output and what I have been attempting so far:

df=pd.DataFrame({'name':['Stephen Dedalus','Helena','Atena','Odisseu','Filoctetes','Medeia','Antígona','Tebas','Aqulies'],
                      'age':[30,32,15,16,45,43,12,13,6],
              'can_read':['yes','no','yes']*3, 'nationality':['Irish','Greek','Brazilian']*3, 
              'parental_situation':['chefe', 'dependente-esposa','filha_dependente','filho_dependente',
                                   'chefe', 'dependente-esposa','filha_dependente','filha_dependente',
                                   'filho_dependente']})
df

So, my data is like:

    name             age    can_read    nationality       parental_situation
0   Stephen Dedalus  30        yes            Irish           chefe
1   Helena           32        no             Greek        dependente-esposa
2   Atena            15        yes            Brazilian    filha_dependente
3   Odisseu          16        yes            Irish        filho_dependente
4   Filoctetes       45        no             Greek            chefe
5   Medeia           43        yes            Brazilian    dependente-esposa
6   Antígona         12        yes            Irish        filha_dependente
7   Tebas            13        no             Greek        filha_dependente
8   Aqulies          6         yes            Brazilian    filho_dependente

Desired output:

  name      age   can_read      nationality    father_canread   mother_canread father_nation    mother_nation
Atena       15      yes         Brazilian            yes            no              Irish   Greek
Odisseu     16      yes         Irish                yes            no              Irish   Greek
Antígona    12      yes         Irish                 no           yes              Greek   Brazilian
Tebas       13      no          Greek                 no           yes              Greek   Brazilian
Aquiles     6       yes         Brazilian             no           yes              Greek Brazilian

What I have been trying

My solution to create the ID variable for each family is the folowing (very inneficient):

df['id']=2 #lastfamily
for k in df.index:
    if list(df['parental_situation'][:k]).count('chefe')==1:
        df['id'][k-1]=1
    elif list(df['parental_situation'][:k]).count('chefe')==2:
        df['id'][k-1]=2

Then I create a function to get the information from parents, and proceed to merge and append all sub-datasets:

def get_parents_info(family_id, father=True):
    if father:
        return df[(df['parental_situation']=='chefe') & (df['id']==family_id)].drop(['name','age','parental_situation'], axis=1).rename(columns={'can_read':'father_canread',
                                                                   'nationality':'father_nationality'}).reset_index(drop=True)
    elif father==False:
        return df[(df['parental_situation']=='dependente-esposa') & (df['id']==family_id)].drop(['name','age',
                                                                                                'parental_situation'], axis=1).rename(columns={'can_read':'mother_canread',
                                                                   'nationality':'mother_nationality'}).reset_index(drop=True)

condition_children = (df['parental_situation']=='filha_dependente') | (df['parental_situation']=='filho_dependente')

final=df.merge(get_parents_info(1,father=False)).merge(get_parents_info(1,father=True)).append(df.merge(get_parents_info(2,father=False)).merge(get_parents_info(2,father=True)))

final[condition_children]

Which gives me exactly what I want. But I think there must be a more efficient and elegant way to do that (my dataframe is rather a big one). Furthermore,it will be nice if there is a more flexible approach since in my real data the sequence of parental information is not so well behaved. The mother can be the family chief (if the man is dead, for example) or she can come first, even if there is a man as the family chief.

Any help will be appreciated!

P.S: sorry, chefe is equivalent to father, dependente-esposa is equivalent to mother. The remaining are all children

Jon Clements · Accepted Answer

There's probably a more efficient way, but with some fiddling about to assign a unique group to each family and some pivoting and adding a column to the children you can kind of get what you want even though it's probably not as flat as you'd like...

# rename to mother/father and we can assume NaN to be child
df['category'] = df['parental_situation'].map({'chefe': 'father', 'dependente-esposa': 'mother'})

# assign a family group
is_parent = df['category'].notna()
df.loc[is_parent, 'group'] = (is_parent ^ is_parent.shift()).astype(int)
df['group'] = df['group'].fillna(0).cumsum()

# Pivot accordingly
parents = df[is_parent].pivot('group', 'category')
# Add extra level to child columns
children = df[~is_parent].set_index('group')
children.columns = pd.MultiIndex.from_product([children.columns, ['child']])

# Do a join...
new = children.join(parents)

This gives you:

           name   age can_read nationality parental_situation category             name            age        can_read        nationality            parental_situation                   
          child child    child       child              child    child           father  mother father mother   father mother      father     mother             father             mother
group                                                                                                                                                                                     
1.0       Atena    15      yes   Brazilian   filha_dependente      NaN  Stephen Dedalus  Helena     30     32      yes     no       Irish      Greek              chefe  dependente-esposa
1.0     Odisseu    16      yes       Irish   filho_dependente      NaN  Stephen Dedalus  Helena     30     32      yes     no       Irish      Greek              chefe  dependente-esposa
2.0    Antígona    12      yes       Irish   filha_dependente      NaN       Filoctetes  Medeia     45     43       no    yes       Greek  Brazilian              chefe  dependente-esposa
2.0       Tebas    13       no       Greek   filha_dependente      NaN       Filoctetes  Medeia     45     43       no    yes       Greek  Brazilian              chefe  dependente-esposa
2.0     Aqulies     6      yes   Brazilian   filho_dependente      NaN       Filoctetes  Medeia     45     43       no    yes       Greek  Brazilian              chefe  dependente-esposa

You might want to do a swap level or drop some columns before joining but this should give you a general approach that gets you close.

Further thought is to restrict the values in the pivot and be explicit with the columns from the children frame, eg:

# Pivot accordingly
parents = df[is_parent].pivot('group', 'category', ['can_read', 'nationality'])
# Add extra level to child columns
children = df.loc[~is_parent, ['group', 'name', 'age', 'can_read', 'nationality']].set_index('group')
children.columns = pd.MultiIndex.from_product([children.columns, ['child']])

Will then give you:

           name   age can_read nationality can_read        nationality           
          child child    child       child   father mother      father     mother
group                                                                            
1.0       Atena    15      yes   Brazilian      yes     no       Irish      Greek
1.0     Odisseu    16      yes       Irish      yes     no       Irish      Greek
2.0    Antígona    12      yes       Irish       no    yes       Greek  Brazilian
2.0       Tebas    13       no       Greek       no    yes       Greek  Brazilian
2.0     Aqulies     6      yes   Brazilian       no    yes       Greek  Brazilian

... which is much closer to your desired result.

How to groupby when there is a pattern by group but not an ID?

Answers (2)

`cumsum`

Related Questions