Reputation: 653
I have the next DataFrame:
data=pd.read_csv('anual.csv', parse_dates='Fecha', index_col=0)
data
DatetimeIndex: 290 entries, 2011-01-01 00:00:00 to 2011-12-31 00:00:00
Data columns (total 12 columns):
HR 290 non-null values
PreciAcu 290 non-null values
RadSolar 290 non-null values
T 290 non-null values
Presion 290 non-null values
Tmax 290 non-null values
HRmax 290 non-null values
Presionmax 290 non-null values
RadSolarmax 290 non-null values
Tmin 290 non-null values
HRmin 290 non-null values
Presionmin 290 non-null values
dtypes: float64(4), int64(8)
where:
data['HR']
Fecha
2011-01-01 37
2011-02-01 70
2011-03-01 62
2011-04-01 69
2011-05-01 72
2011-06-01 71
2011-07-01 71
2011-08-01 70
2011-09-01 40
...
2011-12-17 92
2011-12-18 78
2011-12-19 79
2011-12-20 76
2011-12-21 78
2011-12-22 80
2011-12-23 72
2011-12-24 70
In addition, some months are not always complete. My goal is to calculate the average of each month from daily data. This is achieved as follows:
monthly=data.resample('M', how='mean')
HR PreciAcu RadSolar T Presion Tmax
Fecha
2011-01-31 68.586207 3.744828 163.379310 17.496552 0 25.875862
2011-02-28 68.666667 1.966667 208.000000 18.854167 0 28.879167
2011-03-31 69.136364 3.495455 218.090909 20.986364 0 30.359091
2011-04-30 68.956522 1.913043 221.130435 22.165217 0 31.708696
2011-05-31 72.700000 0.550000 201.100000 18.900000 0 27.460000
2011-06-30 70.821429 6.050000 214.000000 23.032143 0 30.621429
2011-07-31 78.034483 5.810345 188.206897 21.503448 0 27.951724
2011-08-31 71.750000 1.028571 214.750000 22.439286 0 30.657143
2011-09-30 72.481481 0.185185 196.962963 21.714815 0 29.596296
2011-10-31 68.083333 1.770833 224.958333 18.683333 0 27.075000
2011-11-30 71.750000 0.812500 169.625000 18.925000 0 26.237500
2011-12-31 71.833333 0.160000 159.533333 17.260000 0 25.403333
The first error I find is in the column of precipitation, since all observations are 0 in January and an average of 3.74 is obtained for this particular month.
When averages in Excel and compare them with the results above, there is significant variation. For Example, the mean of HR for Febrero is
mean HR using pandas=68.66
mean HR using excel=67
Another detail I found:
data['PreciAcu']['2011-01'].count()
29 and should be 31
Am I doing something wrong? How I can fix this error?
Annex csv file:
[link] https://www.dropbox.com/s/p5hl137bqm82j41/anual.csv
Upvotes: 5
Views: 16189
Reputation: 353389
Your date column is being misinterpreted, because it's in DD/MM/YYYY format. Set dayfirst=True
instead:
>>> df = pd.read_csv('anual.csv', parse_dates='Fecha', dayfirst=True, index_col=0, sep="\s+")
>>> df['PreciAcu']['2011-01'].count()
31
>>> df.resample("M").mean()
HR PreciAcu RadSolar T Presion Tmax \
Fecha
2011-01-31 68.774194 0.000000 162.354839 16.535484 0 25.393548
2011-02-28 67.000000 0.000000 193.481481 15.418519 0 25.696296
2011-03-31 59.083333 0.850000 254.541667 21.295833 0 32.325000
2011-04-30 61.200000 1.312000 260.640000 24.676000 0 34.760000
2011-05-31 NaN NaN NaN NaN NaN NaN
2011-06-30 68.428571 8.576190 236.619048 25.009524 0 32.028571
2011-07-31 81.518519 11.488889 185.407407 22.429630 0 27.681481
2011-08-31 76.451613 0.677419 219.645161 23.677419 0 30.719355
2011-09-30 77.533333 2.883333 196.100000 21.573333 0 28.723333
2011-10-31 73.120000 1.260000 196.280000 19.552000 0 27.636000
2011-11-30 71.277778 -79.333333 148.555556 18.250000 0 26.511111
2011-12-31 73.741935 0.067742 134.677419 15.687097 0 24.019355
HRmax Presionmax Tmin
Fecha
2011-01-31 92.709677 0 10.909677
2011-02-28 92.111111 0 8.325926
2011-03-31 89.291667 0 13.037500
2011-04-30 89.400000 0 17.328000
2011-05-31 NaN NaN NaN
2011-06-30 92.095238 0 19.761905
2011-07-31 97.185185 0 18.774074
2011-08-31 96.903226 0 18.670968
2011-09-30 97.200000 0 16.373333
2011-10-31 97.000000 0 13.412000
2011-11-30 94.555556 0 11.877778
2011-12-31 94.161290 0 10.070968
[12 rows x 9 columns]
(Note, though - I'd forgotten this -- that dayfirst=True
isn't strict, see here. Maybe using date_parser
would be safer.)
Upvotes: 7