Thanh Nguyen
Thanh Nguyen

Reputation: 912

Efficient way to apply conditional function to data grouped by day in Pandas

I want to apply a conditional function to the data grouped every day: For each column that has more than half number of values equal to 0 each day, set all values of the column of that day to np.nan

date,value1,value2
2016-01-01 09:00:00,14,14
2016-01-01 10:00:00,12,13
2016-01-01 11:00:00,11,13
2016-01-01 12:00:00,11,9
2016-01-01 13:00:00,17,21
2016-01-01 14:00:00,9,22
2016-01-01 15:00:00,10,9
2016-01-01 16:00:00,11,9
2016-01-01 17:00:00,8,8
2016-01-01 18:00:00,4,2
2016-01-01 19:00:00,5,7
2016-01-01 20:00:00,5,5
2016-01-01 21:00:00,3,4
2016-01-01 22:00:00,2,4
2016-01-01 23:00:00,2,4
2016-01-02 09:00:00,0,0
2016-01-02 10:00:00,0,0
2016-01-02 11:00:00,0,0
2016-01-02 12:00:00,0,0
2016-01-02 13:00:00,1,0
2016-01-02 14:00:00,0,0
2016-01-02 15:00:00,0,0
2016-01-02 16:00:00,0,0
2016-01-02 17:00:00,0,0
2016-01-02 18:00:00,0,0
2016-01-02 19:00:00,0,0
2016-01-02 20:00:00,1,0
2016-01-02 21:00:00,0,0
2016-01-02 22:00:00,0,0
2016-01-02 23:00:00,0,0

Desired output:

date,value1,value2
2016-01-01 09:00:00,14,14
2016-01-01 10:00:00,12,13
2016-01-01 11:00:00,11,13
2016-01-01 12:00:00,11,9
2016-01-01 13:00:00,17,21
2016-01-01 14:00:00,9,22
2016-01-01 15:00:00,10,9
2016-01-01 16:00:00,11,9
2016-01-01 17:00:00,8,8
2016-01-01 18:00:00,4,2
2016-01-01 19:00:00,5,7
2016-01-01 20:00:00,5,5
2016-01-01 21:00:00,3,4
2016-01-01 22:00:00,2,4
2016-01-01 23:00:00,2,4
2016-01-02 09:00:00,null,null
2016-01-02 10:00:00,null,null
2016-01-02 11:00:00,null,null
2016-01-02 12:00:00,null,null
2016-01-02 13:00:00,null,null
2016-01-02 14:00:00,null,null
2016-01-02 15:00:00,null,null
2016-01-02 16:00:00,null,null
2016-01-02 17:00:00,null,null
2016-01-02 18:00:00,null,null
2016-01-02 19:00:00,null,null
2016-01-02 20:00:00,null,null
2016-01-02 21:00:00,null,null
2016-01-02 22:00:00,null,null
2016-01-02 23:00:00,null,null

I have read this question: pandas apply function to data grouped by day and tried to follow:

df_mode = df.groupby(df.index.date).apply(lambda x: mode(x)[0])

I got the most frequent value for each day in each columns. However I don't know how to process the next step (set all value in the column for that day into np.nan)

And is there any more efficient way than using apply in this case?

Thank you

Upvotes: 2

Views: 52

Answers (1)

jezrael
jezrael

Reputation: 863156

Use GroupBy.transform with compare values by 0 and mean for percentages and then set minssing values by DataFrame.mask:

df = df.mask(df.eq(0).groupby(df.index.date).transform('mean').gt(.5))
print (df)
                     value1  value2
date                               
2016-01-01 09:00:00    14.0    14.0
2016-01-01 10:00:00    12.0    13.0
2016-01-01 11:00:00    11.0    13.0
2016-01-01 12:00:00    11.0     9.0
2016-01-01 13:00:00    17.0    21.0
2016-01-01 14:00:00     9.0    22.0
2016-01-01 15:00:00    10.0     9.0
2016-01-01 16:00:00    11.0     9.0
2016-01-01 17:00:00     8.0     8.0
2016-01-01 18:00:00     4.0     2.0
2016-01-01 19:00:00     5.0     7.0
2016-01-01 20:00:00     5.0     5.0
2016-01-01 21:00:00     3.0     4.0
2016-01-01 22:00:00     2.0     4.0
2016-01-01 23:00:00     2.0     4.0
2016-01-02 09:00:00     NaN     NaN
2016-01-02 10:00:00     NaN     NaN
2016-01-02 11:00:00     NaN     NaN
2016-01-02 12:00:00     NaN     NaN
2016-01-02 13:00:00     NaN     NaN
2016-01-02 14:00:00     NaN     NaN
2016-01-02 15:00:00     NaN     NaN
2016-01-02 16:00:00     NaN     NaN
2016-01-02 17:00:00     NaN     NaN
2016-01-02 18:00:00     NaN     NaN
2016-01-02 19:00:00     NaN     NaN
2016-01-02 20:00:00     NaN     NaN
2016-01-02 21:00:00     NaN     NaN
2016-01-02 22:00:00     NaN     NaN
2016-01-02 23:00:00     NaN     NaN

Upvotes: 4

Related Questions