Reputation: 57
I have data about three variables where I want to find the largest X values of one variable on a per day basis. Previously I wrote some code to find the hour where the max value of the day occurred, but now I want to add some options to find more max hours per day.
I've been able to find the Top X values per day for all the days, but I've gotten stuck on narrowing it down to the Top X Values from the Top X Days. I've included pictures detailing what the end result would hopefully look like.
Code
df = pd.DataFrame(
{'ID':['ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1'],
'Year':[2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018],
'Month':[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],
'Day':[12,12,12,12,13,13,13,13,14,14,14,14,15,15,15,15,16,16,16,16,17,17,17,17],
'Hour':[19,20,21,22,11,12,13,19,19,20,21,22,18,19,20,21,19,20,21,23,19,20,21,22],
'var_1': [0.83,0.97,0.69,0.73,0.66,0.68,0.78,0.82,1.05,1.05,1.08,0.88,0.96,0.81,0.71,0.88,1.08,1.02,0.88,0.79,0.91,0.91,0.80,0.96],
'var_2': [47.90,42.85,67.37,57.18,66.13,59.96,52.63,54.75,32.54,36.58,36.99,37.23,46.94,52.80,68.79,50.84,37.79,43.54,48.04,38.01,42.22,47.13,50.96,44.19],
'var_3': [99.02,98.10,98.99,99.12,98.78,98.90,99.09,99.20,99.22,99.11,99.18,99.24,99.00,98.90,98.87,99.07,99.06,98.86,98.92,99.32,98.93,98.97,98.99,99.21],})
# Get the top 2 var2 values each day
top_two_var2_each_day = df.groupby(['ID', 'Year', 'Month', 'Day'])['var_2'].nlargest(2)
top_two_var2_each_day = top_two_var2_each_day.reset_index()
# set level_4 index to the current index
top_two_var2_each_day = top_two_var2_each_day.set_index('level_4')
# use the index from the top_two_var2 to get the rows from df to get values of the other variables when top 2 values occured
top_2_all_vars = df[df.index.isin(top_two_var2_each_day.index)]
End Goal Result
I figure the best way would be to average the two hours to identify what days have the largest average, then go back into top_2_all_vars dataframe and grab the rows where the Days occur. I am unsure how to proceed.
mean_day = top_2_all_vars.groupby(['ID', 'Year', 'Month', 'Day'],as_index=False)['var_2'].mean()
top_2_day = mean_day.nlargest(2, 'var_2')
This is the result I am trying to find. A dataframe consisting of the Top 2 values of var_2 from each of the Top 2 days.
Code I previously used to find the single largest value of each day, but I don't know how I would make it work for more than a single max per day
# For each ID and Day, Find the Hour where the Max Amount of var_2 occurred and save the index location
df_idx = df.groupby(['ID', 'Year', 'Month', 'Day',])['var_2'].transform(max) == df['var_2']
# Now the hour has been found, store the rows in a new dataframe based on the saved index location
top_var2_hour_of_each_day = df[df_idx]
Using Groupbys may be not the best way to go about it, but I am open to anything.
Upvotes: 0
Views: 291
Reputation: 2579
This is one approach:
If your data spans multiple months its a lot harder dealing with it when the month and day are in different columns. So First I made a new column called 'Date' which just combines the month and the day.
df['Date'] = df['Month'].astype('str')+"-"+df['Day'].astype('str')
Next we need the top two values of var_2 per day, and then average them. So we can create a really simple function to find exactly that.
def topTwoMean(series):
top = series.sort_values(ascending = False).iloc[0]
second = series.sort_values(ascending = False).iloc[1]
return (top+second)/2
We then use our function, sort by the average of var_2 to get the highest 2 days, then save the dates to a list.
maxDates = df.groupby('Date').agg({'var_2': [topTwoMean]})\
.sort_values(by = ('var_2', 'topTwoMean'), ascending = False)\
.reset_index()['Date']\
.head(2)\
.to_list()
Finally we filter by the dates chosen above, then find the highest two of var_2 on those days.
df[df['Date'].isin(maxDates)]\
.groupby('Date')\
.apply(lambda x: x.sort_values('var_2', ascending = False).head(2))\
.reset_index(drop = True)
ID Year Month Day Hour var_1 var_2 var_3 Date
0 ID_1 2018 6 12 21 0.69 67.37 98.99 6-12
1 ID_1 2018 6 12 22 0.73 57.18 99.12 6-12
2 ID_1 2018 6 13 11 0.66 66.13 98.78 6-13
3 ID_1 2018 6 13 12 0.68 59.96 98.90 6-13
Upvotes: 1