Reputation: 87
I have a large dataframe (100.000 rows) with many columns. These are the relevant columns for my question:
id herd birth H_BY HYcount death H_DY HYcount2
1 1345 2005-01-09 134505 1 2010-01-09 134510 1
2 1345 2010-03-05 134510 2 2015-01-09 134515 2
3 1345 2010-05-10 134510 2 2015-01-09 134515 2
4 1345 2011-06-01 134511 1 2016-01-09 134516 1
5 1345 2012-09-01 134512 1 2017-01-09 134517 2
6 1345 2015-09-13 134515 1 2017-01-09 134517 2
7 1346 2015-10-01 134615 3 2019-01-09 134619 1
8 1346 2015-10-27 134615 3 2020-01-09 134620 2
9 1346 2015-11-10 134615 3 2020-01-09 134620 2
10 1346 2016-12-10 134616 1 2021-01-09 134621 1
I am creating Herd-year fixed effects. I have already combined the herd and birth/death columns into herd+birth year and herd+death year into separate columns and counted how many time each fixed effect appears in the dataframe. As can be seen above.
However, now I want to check my whole dataframe for HYcount and HYcount2 that are less then 3. So I don't want any that are either 1 or 2 in a HY group.
I would like to run though the dataframe and combine those HY groups that are 1 or 2 per group into other groups. Below or above.
EDIT
I also want to only combine HY groups WITHIN EACH HERD!
So I don't want to add a herd member of one herd to another herd with the Herd-year variable.
Here is what I've tried with the birth year fixed effect.
#Sort the df by the relevant value
df= df.sort_values(by=['H_BY'])
df.loc[
(df['HYcount'] < 3),
'H_BY'] = df['H_BY'].shift(-1)
#Count the values again
df['HC1_c'] = df.groupby('H_BY')['H_BY'].transform('count')
But this is a very feeble attempt. I have to run through this many many times to rid my dataframe of all values that are less then 3 and it does not work with record number 1. And I want to repeat this process over at least 4 other columns.
EDIT
And of course this code does not do anything about combining within one herd.
Any tips and trick or ideas how I can do this more efficiently?
EDIT
Excepted outcome with the data above.
id herd birth H_BY HYcount death H_DY HYcount2
1 1345 2005-01-09 134510 3 2010-01-09 134515 3
2 1345 2010-03-05 134510 3 2015-01-09 134515 3
3 1345 2010-05-10 134510 3 2015-01-09 134515 3
4 1345 2011-06-01 134515 3 2016-01-09 134517 3
5 1345 2012-09-01 134515 3 2017-01-09 134517 3
6 1345 2015-09-13 134515 3 2017-01-09 134517 3
7 1346 2015-10-01 134615 4 2019-01-09 134620 4
8 1346 2015-10-27 134615 4 2020-01-09 134620 4
9 1346 2015-11-10 134615 4 2020-01-09 134620 4
10 1346 2016-12-10 134615 4 2021-01-09 134620 4
Upvotes: 0
Views: 194
Reputation: 1371
To set up this problem I removed the H_BY
and H_DY
columns in order to allow dynamic counting of the groups. One of the problems with having the counts in the DataFrame is that, as mentioned, you need to recalculate it any
time you change the groupings, as well as it is easy to double count.
I then change birth
and death
to datetimes
in order to create the new columns for birth year and death year, by
and dy
.
ff = df[[ 'herd', 'birth', 'death' ]].copy()
ff[ 'birth' ] = pd.to_datetime( ff[ 'birth' ] )
ff[ 'death' ] = pd.to_datetime( ff[ 'death' ] )
ff = ff.assign(
by = lambda x: x.birth.apply( lambda y: y.year ),
dy = lambda x: x.death.apply( lambda y: y.year )
)
herd | birth | death | by | dy | |
---|---|---|---|---|---|
0 | 1345 | 2005-01-09 | 2010-01-09 | 2005 | 2010 |
1 | 1345 | 2010-03-05 | 2015-01-09 | 2010 | 2015 |
2 | 1345 | 2010-05-10 | 2015-01-09 | 2010 | 2015 |
3 | 1345 | 2011-06-01 | 2016-01-09 | 2011 | 2016 |
4 | 1345 | 2012-09-01 | 2017-01-09 | 2012 | 2017 |
5 | 1345 | 2015-09-13 | 2017-01-09 | 2015 | 2017 |
6 | 1346 | 2015-10-01 | 2019-01-09 | 2015 | 2019 |
7 | 1346 | 2015-10-27 | 2020-01-09 | 2015 | 2020 |
8 | 1346 | 2015-11-10 | 2020-01-09 | 2015 | 2020 |
9 | 1346 | 2016-12-10 | 2021-01-09 | 2016 | 2021 |
For the actual processing we begin by grouping by herd
so we don't mix between them. We then merge groups forward if possible, otherwise backwards, until no more merges must occur. Finally we assign the groups back to the original data.
tdf = []
for herd, data in ff.groupby( 'herd' ):
# get counts and assign initial groups
counts = data[ 'by' ].value_counts().sort_index().to_frame()
counts[ 'group' ] = range( counts.shape[ 0 ] )
while True:
gcounts = counts.groupby( 'group' ).sum()[ 'by' ] # group counts
change = gcounts[ gcounts.values < 3 ] # groups with too few
if change.shape[ 0 ] == 0:
# no changes, exit
break
# check how to merge groups
cgroup = change.index.min()
groups = gcounts.index.values
g_ind = list( groups ).index( cgroup )
if ( g_ind + 1 ) < groups.shape[ 0 ]:
# merge forward
ngroup = groups[ g_ind + 1 ]
elif g_ind > 0:
# merge backward
ngroup = groups[ g_ind - 1 ]
else:
# no groups to merge
print( f'Can not merge herd {herd}' )
break
counts.loc[ counts[ 'group' ] == cgroup, 'group' ] = ngroup
# assign groups
for ind, gdata in counts.iterrows():
data.loc[ data[ 'by' ] == ind, 'group' ] = gdata[ 'group' ]
tdf.append( data )
tdf = pd.concat( tdf )
herd | birth | death | by | dy | group | |
---|---|---|---|---|---|---|
0 | 1345 | 2005-01-09 | 2010-01-09 | 2005 | 2010 | 1.0 |
1 | 1345 | 2010-03-05 | 2015-01-09 | 2010 | 2015 | 1.0 |
2 | 1345 | 2010-05-10 | 2015-01-09 | 2010 | 2015 | 1.0 |
3 | 1345 | 2011-06-01 | 2016-01-09 | 2011 | 2016 | 4.0 |
4 | 1345 | 2012-09-01 | 2017-01-09 | 2012 | 2017 | 4.0 |
5 | 1345 | 2015-09-13 | 2017-01-09 | 2015 | 2017 | 4.0 |
6 | 1346 | 2015-10-01 | 2019-01-09 | 2015 | 2019 | 0.0 |
7 | 1346 | 2015-10-27 | 2020-01-09 | 2015 | 2020 | 0.0 |
8 | 1346 | 2015-11-10 | 2020-01-09 | 2015 | 2020 | 0.0 |
9 | 1346 | 2016-12-10 | 2021-01-09 | 2016 | 2021 | 0.0 |
Finally, if you still want the H_BY
identifier for grouping you can use
tdf[ 'H_BY' ] = tdf[ 'herd' ].astype( 'str' ) + tdf[ 'group' ].astype( int ).astype( str )
Upvotes: 1