Reputation: 8009
In the dataframe below, I would like to eliminate the duplicate cid
values so the output from df.groupby('date').cid.size()
matches the output from df.groupby('date').cid.nunique()
.
I have looked at this post but it does not seem to have a solid solution to the problem.
df = pd.read_csv('https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df')
df.groupby('date')['cid'].agg(['size', 'nunique'])
size nunique
date
2005 7 3
2006 237 10
2007 3610 227
2008 1318 52
2009 2664 142
2010 997 57
2011 6390 219
2012 2904 99
2013 7875 238
2014 3979 146
Things I tried:
df.groupby([df['date']]).drop_duplicates(cols='cid')
gives this error: AttributeError: Cannot access callable attribute 'drop_duplicates' of 'DataFrameGroupBy' objects, try using the 'apply' method
df.groupby(('date').drop_duplicates('cid'))
gives this error: AttributeError: 'str' object has no attribute 'drop_duplicates'
Upvotes: 29
Views: 58812
Reputation: 23371
groupby.head(1)
The relevant groupby
method to drop duplicates in each group is groupby.head(1)
. Note that it is important to pass 1
to select the first row of each date-cid pair.
df1 = df.groupby(['date', 'cid']).head(1)
duplicated()
is more flexibleAnother method is to use duplicated()
to create a boolean mask and filter.
df3 = df[~df.duplicated(['date', 'cid'])]
An advantage of this method over drop_duplicates()
is that is can be chained with other boolean masks to filter the dataframe more flexibly. For example, to select the unique cids in Nevada for each date, use:
df_nv = df[df['state'].eq('NV') & ~df.duplicated(['date', 'cid'])]
groupby.sample(1)
Another method to select a unique row from each group to use groupby.sample()
. Unlike the previous methods mentioned, it selects a row from each group randomly (whereas the others only keep the first row from each group).
df4 = df.groupby(['date', 'cid']).sample(n=1)
You can verify that df1
, df2
(ayhan's output) and df3
all produce the very same output and df4
produces an output where size
and nunique
of cid match for each date (as required in the OP). In short, the following returns True.
w, x, y, z = [d.groupby('date')['cid'].agg(['size', 'nunique']) for d in (df1, df2, df3, df4)]
w.equals(x) and w.equals(y) and w.equals(z) # True
and w
, x
, y
, z
all look like the following:
size nunique
date
2005 7 3
2006 237 10
2007 3610 227
2008 1318 52
2009 2664 142
2010 997 57
2011 6390 219
2012 2904 99
2013 7875 238
2014 3979 146
Upvotes: 5
Reputation:
You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:
df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]:
date
2005 3
2006 10
2007 227
2008 52
2009 142
2010 57
2011 219
2012 99
2013 238
2014 146
dtype: int64
Upvotes: 54