bbartling
bbartling

Reputation: 3494

pandas filter out entire days of dataset based on value

If I have some made up data:

import pandas as pd 
import numpy as np 
from numpy.random import randint


np.random.seed(10)  # added for reproductibility                                                                                                                                                                 

import numpy as np
import pandas as pd
np.random.seed(11)

rows,cols = 50000,2
data = np.random.rand(rows,cols) 
tidx = pd.date_range('2019-01-01', periods=rows, freq='T') 
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)

mediany= df.Value.median()

How could I filter the df based on filtering out entire days of data? For example for each day of data, if the average of Value is less than the entire average Value for entire dataset (mediany), discard this day..

I think I can filter out all data that is less than the average but this doesn't preserve complete days of data which I need.

df = df[(df[['Value']] >= mediany).all(axis=1)]

df

Any tips greatly appreciated hopefully this makes sense!

Upvotes: 0

Views: 51

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150735

You can use groupby().transform:

s = (df['Value'].ge(mediany)            # compare to mediany
        .groupby(df.index.normalize())  # groupby day 
        .transform('any')               # any time with value larger than median
    )

df[s]

P/S: the entire average Value for entire dataset (mediany), median is not average :-)

Upvotes: 1

tuhinsharma121
tuhinsharma121

Reputation: 196

You can try the following code. I have added comments:-

import pandas as pd 
import numpy as np 
from numpy.random import randint


np.random.seed(10)  # added for reproductibility                                                                                                                                                                 

import numpy as np
import pandas as pd
np.random.seed(11)

rows,cols = 50000,2
data = np.random.rand(rows,cols) 
tidx = pd.date_range('2019-01-01', periods=rows, freq='T') 
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
df["Date"] = df.index

#calculate day wise mean

def calculate_mean(x):
    return np.mean(x)

df_day_mean = df.groupby(df.index).agg({
        'Value': [('Value', calculate_mean)]
    })
df_day_mean.columns = df_day_mean.columns.droplevel()


#calculate whole mean

mean = df.Value.mean()

#get the days where average value is less than whole mean

df_to_discard = df_day_mean[df_day_mean["Value"]<mean]
index_to_discard = df_to_discard.index

# drop these indices from the original df

filtered_df = df.drop(index_to_discard)

Upvotes: 0

Related Questions