Reputation: 133
So, I have a dataset with temperatures from 1952 to 2017. I need to calculate average monthly temperatures per each year separately.
Dataset: https://drive.google.com/file/d/1_RZPLaXoKydjjgm4ghkwtbOGWKC4-Ssc/view?usp=sharing
import numpy as np
fp = 'data/1091402.txt'
data = np.genfromtxt(fp, skip_header=2, usecols=(4, 5, 6, 7, 8))
data_mask = (data<-9998)
data[data_mask] = np.nan
date = data[:, 0]
precip = data[:, 1]
tavg = data[:, 2]
tmax = data[:, 3]
tmin = data[:, 4]
Printing the data's first five rows gives the following: (first is date, than precipitation, tavg (temperature average), tmax, and tmin)
[[1.9520101e+07 3.1000000e-01 3.7000000e+01 3.9000000e+01 3.4000000e+01]
[1.9520102e+07 nan 3.5000000e+01 3.7000000e+01 3.4000000e+01]
[1.9520103e+07 1.4000000e-01 3.3000000e+01 3.6000000e+01 nan]
[1.9520104e+07 5.0000000e-02 2.9000000e+01 3.0000000e+01 2.5000000e+01]
[1.9520105e+07 6.0000000e-02 2.7000000e+01 3.0000000e+01 2.5000000e+01]]
Here I remove nan values and the missing data from tavg:
missing_tmax_mask = ~np.isfinite(tmax)
np.count_nonzero(missing_tmax_mask)
tmax_mask = np.isfinite(tmax)
tmax_clean = tmax[tmax_mask]
date_clean = date[tmax_mask]
print (tmax_clean)
[39. 37. 36. ... 48. 49. 56.]
Converting them to int and string again to remove the 'YYYYMMDD.0' and get 'YYYYMMDD'
date_clean_int = date_clean.astype(int)
date_clean_str = date_clean_int.astype(str)
Printing date_clean_str gives the following:
['19520101' '19520102' '19520103' ... '20171001' '20171002' '20171004']
Creating a year, a month and a day array in the format 'YYYY' and 'MM' and 'DD':
year = [datenow[0:4] for datenow in date_clean_str]
year = np.array(year)
month = [d[4:6] for d in date_clean_str]
month = np.array(month)
day = [datenow[6:8] for datenow in date_clean_str]
day = np.array(day)
Printing year, month, and day gives the following:
['1952' '1952' '1952' ... '2017' '2017' '2017']
['01' '01' '01' ... '10' '10' '10']
['01' '02' '03' ... '01' '02' '04']
Here is calculating mean values per month including all years:
means_months = np.zeros(12)
index = 0
for month_now in np.unique(month):
means_months[index] = tmax_clean[(month == month_now) & (year < '2017')].mean()
index = index + 1
Here is calculating per each year:
means_years = np.zeros(65)
index = 0
for year_now in np.unique(year):
means_years[index] = tmax_clean[(year == year_now) & (year < '2017')].mean()
index = index+1
But I want to know how to calculate per every month and separate according to month and year both using numpy and the above code. The total count of values would be 780 = 65 years x 12 months. I would prefer to have an answer in the form as above if possible. Something as:
means_year_month = np.zeros(780)
index = 0
for ….
This is where I get lost. Maybe using a dictionary with {YYYY: MM...}???
Upvotes: 3
Views: 6089
Reputation: 16505
Maybe using pandas.read_fwf()
works better.
import pandas as pd
df = pd.read_fwf('1091402.txt')
df.index = pd.to_datetime(df['DATE'], format='%Y%m%d')
df = df[['TMIN', 'TMAX']]
df = df[df['TMIN'] != -9999][df['TMAX'] != -9999]
print(df.shape)
# print(df)
print()
print('{:7s} | {:12s} | {:12s} | {:12s}'.format(
'year', 'num_records', 'avg TMIN', 'avg TMAX'))
for key, sub_df in df.groupby(df.index.year):
print('{:7d} | {:12d} | {:12.1f} | {:12.1f}'.format(
key,
sub_df.shape[0],
sub_df['TMIN'].mean(),
sub_df['TMAX'].mean()))
print()
print('{:7s} | {:12s} | {:12s} | {:12s}'.format(
'period', 'num_records', 'avg TMIN', 'avg TMAX'))
for key, sub_df in df.groupby([df.index.year, df.index.month]):
print('{:4d}-{:02d} | {:12d} | {:12.1f} | {:12.1f}'.format(
key[0],
key[1],
sub_df.shape[0],
sub_df['TMIN'].mean(),
sub_df['TMAX'].mean()))
The output is:
year | num_records | avg TMIN | avg TMAX
1952 | 240 | 32.5 | 48.0
1953 | 255 | 35.9 | 50.9
1954 | 246 | 36.4 | 49.7
1955 | 265 | 31.2 | 46.4
1956 | 260 | 31.0 | 47.1
...
period | num_records | avg TMIN | avg TMAX
1952-01 | 10 | 27.5 | 35.1
1952-02 | 18 | 17.2 | 28.8
1952-03 | 20 | -1.1 | 25.6
1952-04 | 23 | 30.1 | 49.7
1952-05 | 21 | 33.6 | 52.9
...
Upvotes: 0
Reputation: 1289
I'm not sure I would use numpy for grouping, but it seems like you are ok with pandas. This is how that can be done:
import pandas as pd
import datetime as dt
# This command is executed in shell due to '!' sign.
# It replaces all extra whitespaces with single one.
!cat 1091402.txt | sed 's/ \{1,\}/ /g' > 1091402_trimmed.txt
df = pd.read_csv('1091402_trimmed.txt', sep=' ')
# Omit line with hyphens
df = df[1:]
# Parse datetime
df['date'] = pd.to_datetime(df['DATE'])
# Extract year and month
df['year'] = df['date'].apply(lambda x: x.year)
df['month'] = df['date'].apply(lambda x: x.month)
for column in ('TMAX', 'TMIN', 'TAVG'):
# Set N/A for -9999 values
df[column].replace('-9999', None, inplace=True)
# Cast all columns to int
df[column] = df[column].astype('int64')
# Grouping
df.groupby(['year', 'month']).agg({
'TAVG': ['mean', 'median'],
'TMAX': ['mean', 'median'],
'TMIN': ['mean', 'median'],
}).head()
Outputs:
Out[1]:
TAVG TMAX TMIN
mean median mean median mean median
year month
1952 1 29.478261 29.0 32.608696 30.0 28.434783 28.0
2 24.800000 26.0 29.000000 28.0 18.400000 19.0
3 13.807692 10.5 26.423077 25.0 1.230769 -4.0
4 39.607143 38.0 49.035714 48.0 30.285714 30.0
5 44.666667 44.0 52.555556 54.0 33.629630 34.0
Upvotes: 0
Reputation: 31
b=pd.read_csv('b.dat')
b['date']=pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
b.index=b['date']
b.index.month # will give you indexes of months (can access the month like this)
df.groupby(by=[b.index.month])
or year or day and then calcu;late the average simple.
Did you try this ? this is fast and efficient way.
Upvotes: 2