Reputation: 3494
If I have some made up data:
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
import numpy as np
import pandas as pd
np.random.seed(11)
rows,cols = 50000,2
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='T')
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
mediany= df.Value.median()
How could I filter the df
based on filtering out entire days of data? For example for each day of data, if the average of Value
is less than the entire average Value
for entire dataset (mediany
), discard this day..
I think I can filter out all data that is less than the average but this doesn't preserve complete days of data which I need.
df = df[(df[['Value']] >= mediany).all(axis=1)]
df
Any tips greatly appreciated hopefully this makes sense!
Upvotes: 0
Views: 51
Reputation: 150735
You can use groupby().transform
:
s = (df['Value'].ge(mediany) # compare to mediany
.groupby(df.index.normalize()) # groupby day
.transform('any') # any time with value larger than median
)
df[s]
P/S: the entire average Value for entire dataset (mediany), median is not average :-)
Upvotes: 1
Reputation: 196
You can try the following code. I have added comments:-
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
import numpy as np
import pandas as pd
np.random.seed(11)
rows,cols = 50000,2
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='T')
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
df["Date"] = df.index
#calculate day wise mean
def calculate_mean(x):
return np.mean(x)
df_day_mean = df.groupby(df.index).agg({
'Value': [('Value', calculate_mean)]
})
df_day_mean.columns = df_day_mean.columns.droplevel()
#calculate whole mean
mean = df.Value.mean()
#get the days where average value is less than whole mean
df_to_discard = df_day_mean[df_day_mean["Value"]<mean]
index_to_discard = df_to_discard.index
# drop these indices from the original df
filtered_df = df.drop(index_to_discard)
Upvotes: 0