Grouping rows based on two columns and create a third column, find groups with less than x and combine with other groups

Question

I have a large dataframe (100.000 rows) with many columns. These are the relevant columns for my question:

id   herd        birth     H_BY  HYcount      death       H_DY   HYcount2
1    1345   2005-01-09    134505       1  2010-01-09    134510       1
2    1345   2010-03-05    134510       2  2015-01-09    134515       2
3    1345   2010-05-10    134510       2  2015-01-09    134515       2
4    1345   2011-06-01    134511       1  2016-01-09    134516       1
5    1345   2012-09-01    134512       1  2017-01-09    134517       2
6    1345   2015-09-13    134515       1  2017-01-09    134517       2
7    1346   2015-10-01    134615       3  2019-01-09    134619       1
8    1346   2015-10-27    134615       3  2020-01-09    134620       2
9    1346   2015-11-10    134615       3  2020-01-09    134620       2
10   1346   2016-12-10    134616       1  2021-01-09    134621       1

I am creating Herd-year fixed effects. I have already combined the herd and birth/death columns into herd+birth year and herd+death year into separate columns and counted how many time each fixed effect appears in the dataframe. As can be seen above.

However, now I want to check my whole dataframe for HYcount and HYcount2 that are less then 3. So I don't want any that are either 1 or 2 in a HY group.

I would like to run though the dataframe and combine those HY groups that are 1 or 2 per group into other groups. Below or above.

EDIT

I also want to only combine HY groups WITHIN EACH HERD!

So I don't want to add a herd member of one herd to another herd with the Herd-year variable.

Here is what I've tried with the birth year fixed effect.

#Sort the df by the relevant value
df= df.sort_values(by=['H_BY'])


df.loc[
    (df['HYcount'] < 3),
    'H_BY'] = df['H_BY'].shift(-1)

#Count the values again 
df['HC1_c'] = df.groupby('H_BY')['H_BY'].transform('count')

But this is a very feeble attempt. I have to run through this many many times to rid my dataframe of all values that are less then 3 and it does not work with record number 1. And I want to repeat this process over at least 4 other columns.

EDIT

And of course this code does not do anything about combining within one herd.

Any tips and trick or ideas how I can do this more efficiently?

EDIT

Excepted outcome with the data above.

id   herd        birth     H_BY  HYcount      death       H_DY   HYcount2
1    1345   2005-01-09    134510       3  2010-01-09    134515       3
2    1345   2010-03-05    134510       3  2015-01-09    134515       3
3    1345   2010-05-10    134510       3  2015-01-09    134515       3
4    1345   2011-06-01    134515       3  2016-01-09    134517       3
5    1345   2012-09-01    134515       3  2017-01-09    134517       3
6    1345   2015-09-13    134515       3  2017-01-09    134517       3

7    1346   2015-10-01    134615       4  2019-01-09    134620       4
8    1346   2015-10-27    134615       4  2020-01-09    134620       4
9    1346   2015-11-10    134615       4  2020-01-09    134620       4
10   1346   2016-12-10    134615       4  2021-01-09    134620       4

bicarlsen · Accepted Answer

To set up this problem I removed the H_BY and H_DY columns in order to allow dynamic counting of the groups. One of the problems with having the counts in the DataFrame is that, as mentioned, you need to recalculate it any time you change the groupings, as well as it is easy to double count.

I then change birth and death to datetimes in order to create the new columns for birth year and death year, by and dy.

ff = df[[ 'herd', 'birth', 'death' ]].copy()

ff[ 'birth' ] = pd.to_datetime( ff[ 'birth' ] )
ff[ 'death' ] = pd.to_datetime( ff[ 'death' ] )
ff = ff.assign( 
    by = lambda x: x.birth.apply( lambda y: y.year ),
    dy = lambda x: x.death.apply( lambda y: y.year )
)

	herd	birth	death	by	dy
0	1345	2005-01-09	2010-01-09	2005	2010
1	1345	2010-03-05	2015-01-09	2010	2015
2	1345	2010-05-10	2015-01-09	2010	2015
3	1345	2011-06-01	2016-01-09	2011	2016
4	1345	2012-09-01	2017-01-09	2012	2017
5	1345	2015-09-13	2017-01-09	2015	2017
6	1346	2015-10-01	2019-01-09	2015	2019
7	1346	2015-10-27	2020-01-09	2015	2020
8	1346	2015-11-10	2020-01-09	2015	2020
9	1346	2016-12-10	2021-01-09	2016	2021

For the actual processing we begin by grouping by herd so we don't mix between them. We then merge groups forward if possible, otherwise backwards, until no more merges must occur. Finally we assign the groups back to the original data.

tdf = []
for herd, data in ff.groupby( 'herd' ):
    # get counts and assign initial groups
    counts = data[ 'by' ].value_counts().sort_index().to_frame()
    counts[ 'group' ] = range( counts.shape[ 0 ] )
    
    while True:
        gcounts = counts.groupby( 'group' ).sum()[ 'by' ]  # group counts
        change = gcounts[ gcounts.values < 3 ]  # groups with too few
        
        if change.shape[ 0 ] == 0:
            # no changes, exit
            break

        # check how to merge groups
        cgroup = change.index.min()
        groups = gcounts.index.values
        g_ind = list( groups ).index( cgroup )
        if ( g_ind + 1 ) < groups.shape[ 0 ]:
            # merge forward
            ngroup = groups[ g_ind + 1 ]
            
        elif g_ind > 0:
            # merge backward
            ngroup = groups[ g_ind - 1 ]
            
        else:
            # no groups to merge
            print( f'Can not merge herd {herd}' )
            break

        counts.loc[ counts[ 'group' ] == cgroup, 'group' ] = ngroup 

    # assign groups
    for ind, gdata in counts.iterrows():
        data.loc[ data[ 'by' ] == ind, 'group' ] = gdata[ 'group' ]
        
    tdf.append( data )
    
tdf = pd.concat( tdf )

	herd	birth	death	by	dy	group
0	1345	2005-01-09	2010-01-09	2005	2010	1.0
1	1345	2010-03-05	2015-01-09	2010	2015	1.0
2	1345	2010-05-10	2015-01-09	2010	2015	1.0
3	1345	2011-06-01	2016-01-09	2011	2016	4.0
4	1345	2012-09-01	2017-01-09	2012	2017	4.0
5	1345	2015-09-13	2017-01-09	2015	2017	4.0
6	1346	2015-10-01	2019-01-09	2015	2019	0.0
7	1346	2015-10-27	2020-01-09	2015	2020	0.0
8	1346	2015-11-10	2020-01-09	2015	2020	0.0
9	1346	2016-12-10	2021-01-09	2016	2021	0.0

Finally, if you still want the H_BY identifier for grouping you can use

tdf[ 'H_BY' ] = tdf[ 'herd' ].astype( 'str' ) + tdf[ 'group' ].astype( int ).astype( str )

Grouping rows based on two columns and create a third column, find groups with less than x and combine with other groups

Answers (1)

Related Questions