The Great
The Great

Reputation: 7733

How to drop records based on number of unique days using pandas?

I have a dataframe like as shown below

df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-03 13:39:00','2173-07-04 11:30:00','2173-04-04 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
 'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month

What I would like to do is drop records/subjects who doesn't have more than 4 or more unique days

If you see my sample dataframe, you can see that subject_id = 1 has only 3 unique dates which is 3,4 and 5 so I would like to drop subject_id = 1 completely. But if you see subject_id = 2 he has more than 4 unique dates like 4,9,11,13,14. Please note that date values has timestamp, hence I extract the day from each datetime field and check for unique records.

This is what I tried

df.groupby(['subject_id','day']).transform('size')>4 # doesn't work
df[df.groupby(['subject_id','day'])['subject_id'].transform('size')>=4] # doesn't produce expected output

I expect my output to be like this

enter image description here

Upvotes: 2

Views: 37

Answers (1)

jezrael
jezrael

Reputation: 863166

Change your function from size to DataFrameGroupBy.nunique, grouping only by the subject_id column:

df = df[df.groupby('subject_id')['day'].transform('nunique')>=4] 

Or alternatively you can use filtration, but this should be slower if you're using a larger dataframe or many unique groups:

df = df.groupby('subject_id').filter(lambda x: x['day'].nunique()>=4)

print (df)
    subject_id              time_1  val  day  month
7            2 2173-04-04 16:00:00    5    4      4
8            2 2173-04-09 22:00:00    8    9      4
9            2 2173-04-11 04:00:00    3   11      4
10           2 2173-04-13 04:30:00    4   13      4
11           2 2173-04-14 08:00:00    6   14      4

Upvotes: 3

Related Questions