Reputation: 23
I have a data set of temperature recorded every 15 minutes. The file looks like that (~50000 rows)
02/01/2016;05:15:00;10.800
02/01/2016;05:30:00;10.300
02/01/2016;05:45:00;9.200
02/01/2016;06:00:00;9.200
02/01/2016;06:15:00;8.900
02/01/2016;06:30:00;8.900
02/01/2016;06:45:00;9.400
02/01/2016;07:00:00;9.000
02/01/2016;07:15:00;9.200
02/01/2016;07:30:00;11.100
02/01/2016;07:45:00;13.000
02/01/2016;08:00:00;14.400
02/01/2016;08:15:00;15.600
My goal is to calculate daily min/max, so here my code to do it
# load dataframe
with open(intraday_file_path, "r") as fl:
df_intraday = pd.read_csv(fl,
**load_args
)
df_daily = df_intraday.groupby(df_intraday[0])
df_daily = df_daily.aggregate({0:np.max})
df_daily.index.names = [0]
df_daily.reset_index(level=[0], inplace=True)
df_daily.sort_values(by=[0], inplace=True)
df_daily.drop_duplicates(subset=0,
keep="first",
inplace=True)
daily_name = "daily_%s" %(intraday_file_name,)
daily_path = os.getcwd() + "\\" + daily_name
df_daily = df_daily[0, 1]
with open(daily_path, "w") as fl:
df_daily.to_csv(fl,
**save_args
)
But the output is strange as soon as I have a temperature below 10°C. For example for 02/01/2016 the code output is 9.4°C?!
Any ideas?
Upvotes: 2
Views: 69
Reputation: 7316
FYI, there is resample
option. Handy tool for time series.
import pandas as pd
df = pd.read_table('sample.txt', header=None, sep=';')
df.columns=['date', 'time', 'temp']
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
df['temp'] = df['temp'].astype(float) # dtypes should be float as jezrael mentioned.
df = df.set_index('datetime')[['temp']]
df = pd.concat([df.resample('1D', how=min),
df.resample('1D', how=max)], axis=1)
df.columns = ['temp_min', 'temp_max']
print(df)
result
temp_min temp_max
datetime
2016-02-01 8.9 15.6
Upvotes: 0
Reputation: 862671
There is problem your data are not numeric in last column.
Solution is use to_numeric
for convert bad data to NaN
s:
Also for better working with DataFrame is possible add parameter names
to read_csv
for column names.
import pandas as pd
from pandas.compat import StringIO
temp=u"""02/01/2016;05:15:00;10.800
02/01/2016;05:30:00;10.300
02/01/2016;05:45:00;9.200
02/01/2016;06:00:00;9.200
02/01/2016;06:15:00;8.900
02/01/2016;06:30:00;8.900
02/01/2016;06:45:00;9.400
03/01/2016;07:00:00;9.000
03/01/2016;07:15:00;9.200
03/01/2016;07:30:00;11.100
04/01/2016;07:45:00;13.000
04/01/2016;08:00:00;14.400
04/01/2016;08:15:00;a"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df_intraday = pd.read_csv(StringIO(temp),
sep=";",
names=['date','time','val'],
parse_dates=[0])
print (df_intraday)
date time val
0 2016-02-01 05:15:00 10.800
1 2016-02-01 05:30:00 10.300
2 2016-02-01 05:45:00 9.200
3 2016-02-01 06:00:00 9.200
4 2016-02-01 06:15:00 8.900
5 2016-02-01 06:30:00 8.900
6 2016-02-01 06:45:00 9.400
7 2016-03-01 07:00:00 9.000
8 2016-03-01 07:15:00 9.200
9 2016-03-01 07:30:00 11.100
10 2016-04-01 07:45:00 13.000
11 2016-04-01 08:00:00 14.400
12 2016-04-01 08:15:00 a
df_daily = df_intraday.groupby('date', as_index=False)['val'].max()
print (df_daily)
date val
0 2016-02-01 9.400
1 2016-03-01 9.200
2 2016-04-01 a
#check dtypes - object is obviusly string
print (df_intraday['val'].dtypes)
object
df_intraday['val'] = pd.to_numeric(df_intraday['val'], errors='coerce')
print (df_intraday)
date time val
0 2016-02-01 05:15:00 10.8
1 2016-02-01 05:30:00 10.3
2 2016-02-01 05:45:00 9.2
3 2016-02-01 06:00:00 9.2
4 2016-02-01 06:15:00 8.9
5 2016-02-01 06:30:00 8.9
6 2016-02-01 06:45:00 9.4
7 2016-03-01 07:00:00 9.0
8 2016-03-01 07:15:00 9.2
9 2016-03-01 07:30:00 11.1
10 2016-04-01 07:45:00 13.0
11 2016-04-01 08:00:00 14.4
12 2016-04-01 08:15:00 NaN
print (df_intraday['val'].dtypes)
float64
#simplier way for aggregating max
df_daily = df_intraday.groupby('date', as_index=False)['val'].max()
print (df_daily)
date val
0 2016-02-01 10.8
1 2016-03-01 11.1
2 2016-04-01 14.4
Upvotes: 2