Get the daily maximum gives strange results

Question

I have a data set of temperature recorded every 15 minutes. The file looks like that (~50000 rows)

02/01/2016;05:15:00;10.800
02/01/2016;05:30:00;10.300
02/01/2016;05:45:00;9.200
02/01/2016;06:00:00;9.200
02/01/2016;06:15:00;8.900
02/01/2016;06:30:00;8.900
02/01/2016;06:45:00;9.400
02/01/2016;07:00:00;9.000
02/01/2016;07:15:00;9.200
02/01/2016;07:30:00;11.100
02/01/2016;07:45:00;13.000
02/01/2016;08:00:00;14.400
02/01/2016;08:15:00;15.600

My goal is to calculate daily min/max, so here my code to do it

# load dataframe
with open(intraday_file_path, "r") as fl:
    df_intraday = pd.read_csv(fl,
                              **load_args
                              )

df_daily = df_intraday.groupby(df_intraday[0])
df_daily = df_daily.aggregate({0:np.max})

df_daily.index.names = [0]
df_daily.reset_index(level=[0], inplace=True)

df_daily.sort_values(by=[0], inplace=True)
df_daily.drop_duplicates(subset=0,
                         keep="first",
                         inplace=True)

daily_name = "daily_%s" %(intraday_file_name,)
daily_path = os.getcwd() + "\" + daily_name

df_daily = df_daily[0, 1]

with open(daily_path, "w") as fl:
    df_daily.to_csv(fl,
                    **save_args
                    )

But the output is strange as soon as I have a temperature below 10°C. For example for 02/01/2016 the code output is 9.4°C?!

Any ideas?

jezrael · Accepted Answer

There is problem your data are not numeric in last column.

Solution is use to_numeric for convert bad data to NaNs:

Also for better working with DataFrame is possible add parameter names to read_csv for column names.

import pandas as pd
from pandas.compat import StringIO
temp=u"""02/01/2016;05:15:00;10.800
02/01/2016;05:30:00;10.300
02/01/2016;05:45:00;9.200
02/01/2016;06:00:00;9.200
02/01/2016;06:15:00;8.900
02/01/2016;06:30:00;8.900
02/01/2016;06:45:00;9.400
03/01/2016;07:00:00;9.000
03/01/2016;07:15:00;9.200
03/01/2016;07:30:00;11.100
04/01/2016;07:45:00;13.000
04/01/2016;08:00:00;14.400
04/01/2016;08:15:00;a"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df_intraday = pd.read_csv(StringIO(temp), 
                          sep=";", 
                          names=['date','time','val'], 
                          parse_dates=[0])
print (df_intraday)
         date      time     val
0  2016-02-01  05:15:00  10.800
1  2016-02-01  05:30:00  10.300
2  2016-02-01  05:45:00   9.200
3  2016-02-01  06:00:00   9.200
4  2016-02-01  06:15:00   8.900
5  2016-02-01  06:30:00   8.900
6  2016-02-01  06:45:00   9.400
7  2016-03-01  07:00:00   9.000
8  2016-03-01  07:15:00   9.200
9  2016-03-01  07:30:00  11.100
10 2016-04-01  07:45:00  13.000
11 2016-04-01  08:00:00  14.400
12 2016-04-01  08:15:00       a

df_daily = df_intraday.groupby('date', as_index=False)['val'].max()
print (df_daily)
        date    val
0 2016-02-01  9.400
1 2016-03-01  9.200
2 2016-04-01      a

#check dtypes -  object is obviusly string
print (df_intraday['val'].dtypes)
object

df_intraday['val'] = pd.to_numeric(df_intraday['val'], errors='coerce')
print (df_intraday)
         date      time   val
0  2016-02-01  05:15:00  10.8
1  2016-02-01  05:30:00  10.3
2  2016-02-01  05:45:00   9.2
3  2016-02-01  06:00:00   9.2
4  2016-02-01  06:15:00   8.9
5  2016-02-01  06:30:00   8.9
6  2016-02-01  06:45:00   9.4
7  2016-03-01  07:00:00   9.0
8  2016-03-01  07:15:00   9.2
9  2016-03-01  07:30:00  11.1
10 2016-04-01  07:45:00  13.0
11 2016-04-01  08:00:00  14.4
12 2016-04-01  08:15:00   NaN

print (df_intraday['val'].dtypes)
float64

#simplier way for aggregating max
df_daily = df_intraday.groupby('date', as_index=False)['val'].max()
print (df_daily)
        date   val
0 2016-02-01  10.8
1 2016-03-01  11.1
2 2016-04-01  14.4

Get the daily maximum gives strange results

Answers (2)

Related Questions