Reputation: 873
I have a dataset that contains columns of year, julian day, hour and temperature. I have grouped the data by year and day and now want to perform an operation on the temperature data IF each day contains 24 hours worth of data. Then, I want to create a Dataframe with year, julian day, max temperature and min temperature. However, I'm not sure of the syntax to make sure this condition is met. Any help would be appreciated. My code is below:
df = pd.read_table(data,skiprows=1,sep='\t',usecols=(0,3,4,6),names=['year','jday','hour','temp'],na_values=-999.9)
g = df.groupby(['year','jday'])
if #the grouped year and day has 24 hours worth of data
maxt = g.aggregate({'temp':np.max})
mint = g.aggregate({'temp':np.min})
else:
continue
And some sample data (goes from 1942-2015):
Year Month Day Julian Hour Wind TempC DewC Pressure RH
1942 9 24 267 9 2.1 18.5 15.2 1014.2 81.0
1942 9 24 267 10 2.1 23.5 14.6 1014.6 57.0
1942 9 24 267 11 3.6 25.2 12.4 1014.2 45.0
1942 9 24 267 12 3.6 26.8 11.9 1014.2 40.0
1942 9 24 267 13 2.6 27.4 11.9 1014.2 38.0
1942 9 24 267 14 2.1 28.0 11.3 1013.5 35.0
1942 9 24 267 15 4.1 29.1 9.1 1013.5 29.0
1942 9 24 267 16 4.1 29.1 10.7 1013.5 32.0
1942 9 24 267 17 4.6 29.1 13.0 1013.9 37.0
1942 9 24 267 18 3.6 25.7 12.4 1015.2 44.0
1942 9 24 267 19 0.0 23.0 16.3 1015.2 66.0
1942 9 24 267 20 2.6 22.4 15.7 1015.9 66.0
1942 9 24 267 21 2.1 20.2 16.3 1016.3 78.0
1942 9 24 267 22 3.1 20.2 14.6 1016.9 70.0
1942 9 24 267 23 2.6 19.6 15.2 1017.6 76.0
1942 9 25 268 0 3.1 18.5 13.5 1018.3 73.0
1942 9 25 268 1 2.6 16.9 13.0 1018.3 78.0
1942 9 25 268 2 4.1 15.7 5.2 1021.0 50.0
1942 9 25 268 3 4.1 15.2 4.1 1020.7 47.0
1942 9 25 268 4 3.1 14.1 5.8 1021.3 57.0
1942 9 25 268 5 3.1 13.0 5.8 1021.3 62.0
1942 9 25 268 6 2.1 13.0 5.2 1022.4 59.0
1942 9 25 268 7 2.1 12.4 1.9 1022.4 49.0
1942 9 25 268 8 3.6 13.5 5.8 1024.7 60.0
1942 9 25 268 9 4.6 15.7 3.5 1025.1 44.0
1942 9 25 268 10 4.1 17.4 1.3 1025.4 34.0
1942 9 25 268 11 2.6 18.5 3.0 1025.4 36.0
1942 9 25 268 12 2.1 19.1 0.8 1025.1 29.0
1942 9 25 268 13 2.6 19.6 2.4 1024.7 32.0
1942 9 25 268 14 4.1 20.7 4.6 1023.4 35.0
1942 9 25 268 15 3.6 21.3 4.1 1023.7 32.0
1942 9 25 268 16 1.5 21.3 4.6 1023.4 34.0
1942 9 25 268 17 5.1 20.7 7.4 1023.4 42.0
1942 9 25 268 18 5.1 19.1 8.5 1023.0 50.0
1942 9 25 268 19 3.6 18.0 9.6 1022.7 58.0
1942 9 25 268 20 3.1 16.3 9.6 1023.0 65.0
1942 9 25 268 21 1.5 15.2 11.3 1023.0 78.0
1942 9 25 268 22 1.5 14.6 11.3 1023.0 81.0
1942 9 25 268 23 2.1 14.1 10.7 1024.0 80.0
Upvotes: 4
Views: 5991
Reputation: 5372
I would start by grouping on day alone, and then iterate over the groups, checking the unique hours in each group. You can use set
to find the unique hours for each measurement day, and compare with a full days worth of hours {0,1,2,3,...23}
a_full_day = set(range(24))
#data_out = {}
gb = df.groupby(['jday']) # only group by day
for day, inds in gb.groups.iteritems():
if set(df.ix[inds, 'hour']) == a_full_day:
maxt = df.ix[inds, 'temp'].max()
#data_out[day] = {}
#data_out[day]['maxt'] = maxt
# etc
I added some commented lines suggesting how you might want to store the output
Upvotes: 0
Reputation: 227
I assume that there is no ['year', 'julian']
group which contains non-integer hours so we can just use the length of the group as the condition.
import pandas as pd
def get_min_max_by_date(df_group):
if len(df_group['hour'].unique()) < 24:
new_df = pd.DataFrame()
else:
year = df_group['year'].unique()[0]
j_day = df_group['jday'].unique()[0]
min_temp = df_group['temp'].min()
max_temp = df_group['temp'].max()
new_df = pd.DataFrame({'year': [year],
'julian_day': [j_day],
'min_temp': [min_temp],
'max_temp': [max_temp]}, index=[0])
return new_df
df = pd.read_table(data,
skiprows=1,
sep='\t',
usecols=(0, 3, 4, 6),
names=['year', 'jday', 'hour', 'temp'],
na_values=-999.9)
final_df = df.groupby(['year', 'jday'],
as_index=False).apply(get_min_max_by_date)
final_df = final_df.reset_index()
I don't have time to test this right now, but this should get you started.
Upvotes: 2