mkw
mkw

Reputation: 133

How do I calculate average monthly temperatures per each year separately using numpy?

So, I have a dataset with temperatures from 1952 to 2017. I need to calculate average monthly temperatures per each year separately.

Dataset: https://drive.google.com/file/d/1_RZPLaXoKydjjgm4ghkwtbOGWKC4-Ssc/view?usp=sharing

import numpy as np
fp = 'data/1091402.txt'
data = np.genfromtxt(fp, skip_header=2, usecols=(4, 5, 6, 7, 8))
data_mask = (data<-9998)
data[data_mask] = np.nan
date = data[:, 0]
precip = data[:, 1]
tavg = data[:, 2]
tmax = data[:, 3]
tmin = data[:, 4]

Printing the data's first five rows gives the following: (first is date, than precipitation, tavg (temperature average), tmax, and tmin)

[[1.9520101e+07 3.1000000e-01 3.7000000e+01 3.9000000e+01 3.4000000e+01]
 [1.9520102e+07           nan 3.5000000e+01 3.7000000e+01 3.4000000e+01]
 [1.9520103e+07 1.4000000e-01 3.3000000e+01 3.6000000e+01           nan]
 [1.9520104e+07 5.0000000e-02 2.9000000e+01 3.0000000e+01 2.5000000e+01]
 [1.9520105e+07 6.0000000e-02 2.7000000e+01 3.0000000e+01 2.5000000e+01]]

Here I remove nan values and the missing data from tavg:

missing_tmax_mask =  ~np.isfinite(tmax)
np.count_nonzero(missing_tmax_mask)
tmax_mask = np.isfinite(tmax)
tmax_clean = tmax[tmax_mask]
date_clean = date[tmax_mask]
print (tmax_clean)
[39. 37. 36. ... 48. 49. 56.]

Converting them to int and string again to remove the 'YYYYMMDD.0' and get 'YYYYMMDD'

date_clean_int = date_clean.astype(int)
date_clean_str = date_clean_int.astype(str)

Printing date_clean_str gives the following:

['19520101' '19520102' '19520103' ... '20171001' '20171002' '20171004']

Creating a year, a month and a day array in the format 'YYYY' and 'MM' and 'DD':

year = [datenow[0:4] for datenow in date_clean_str]
year = np.array(year)
month = [d[4:6] for d in date_clean_str]
month = np.array(month)
day = [datenow[6:8] for datenow in date_clean_str]
day = np.array(day)

Printing year, month, and day gives the following:

['1952' '1952' '1952' ... '2017' '2017' '2017']
['01' '01' '01' ... '10' '10' '10']
['01' '02' '03' ... '01' '02' '04']

Here is calculating mean values per month including all years:

means_months = np.zeros(12)
index = 0
for month_now in np.unique(month):
    means_months[index] = tmax_clean[(month == month_now) & (year < '2017')].mean()
    index = index + 1

Here is calculating per each year:

means_years = np.zeros(65)
index = 0
for year_now in np.unique(year):
    means_years[index] = tmax_clean[(year == year_now) & (year < '2017')].mean()
    index = index+1

But I want to know how to calculate per every month and separate according to month and year both using numpy and the above code. The total count of values would be 780 = 65 years x 12 months. I would prefer to have an answer in the form as above if possible. Something as:

means_year_month = np.zeros(780)
index = 0
for ….

This is where I get lost. Maybe using a dictionary with {YYYY: MM...}???

Upvotes: 3

Views: 6089

Answers (3)

Ralf
Ralf

Reputation: 16505

Maybe using pandas.read_fwf() works better.

import pandas as pd

df = pd.read_fwf('1091402.txt')
df.index = pd.to_datetime(df['DATE'], format='%Y%m%d')
df = df[['TMIN', 'TMAX']]
df = df[df['TMIN'] != -9999][df['TMAX'] != -9999]
print(df.shape)
# print(df)

print()
print('{:7s} | {:12s} | {:12s} | {:12s}'.format(
    'year', 'num_records', 'avg TMIN', 'avg TMAX'))
for key, sub_df in df.groupby(df.index.year):
    print('{:7d} | {:12d} | {:12.1f} | {:12.1f}'.format(
        key,
        sub_df.shape[0],
        sub_df['TMIN'].mean(),
        sub_df['TMAX'].mean()))

print()
print('{:7s} | {:12s} | {:12s} | {:12s}'.format(
    'period', 'num_records', 'avg TMIN', 'avg TMAX'))
for key, sub_df in df.groupby([df.index.year, df.index.month]):
    print('{:4d}-{:02d} | {:12d} | {:12.1f} | {:12.1f}'.format(
        key[0],
        key[1],
        sub_df.shape[0],
        sub_df['TMIN'].mean(),
        sub_df['TMAX'].mean()))

The output is:

year    | num_records  | avg TMIN     | avg TMAX    
  1952  |          240 |         32.5 |         48.0
  1953  |          255 |         35.9 |         50.9
  1954  |          246 |         36.4 |         49.7
  1955  |          265 |         31.2 |         46.4
  1956  |          260 |         31.0 |         47.1
...

period  | num_records  | avg TMIN     | avg TMAX    
1952-01 |           10 |         27.5 |         35.1
1952-02 |           18 |         17.2 |         28.8
1952-03 |           20 |         -1.1 |         25.6
1952-04 |           23 |         30.1 |         49.7
1952-05 |           21 |         33.6 |         52.9
...

Upvotes: 0

beyondfloatingpoint
beyondfloatingpoint

Reputation: 1289

I'm not sure I would use numpy for grouping, but it seems like you are ok with pandas. This is how that can be done:

import pandas as pd
import datetime as dt

# This command is executed in shell due to '!' sign. 
# It replaces all extra whitespaces with single one.
!cat 1091402.txt | sed 's/ \{1,\}/ /g' > 1091402_trimmed.txt
df = pd.read_csv('1091402_trimmed.txt', sep=' ')

# Omit line with hyphens
df = df[1:]
# Parse datetime
df['date'] = pd.to_datetime(df['DATE'])
# Extract year and month
df['year'] = df['date'].apply(lambda x: x.year)
df['month'] = df['date'].apply(lambda x: x.month)
for column in ('TMAX', 'TMIN', 'TAVG'):
    # Set N/A for -9999 values
    df[column].replace('-9999', None, inplace=True) 
    # Cast all columns to int
    df[column] = df[column].astype('int64')
# Grouping
df.groupby(['year', 'month']).agg({
    'TAVG': ['mean', 'median'],
    'TMAX': ['mean', 'median'],
    'TMIN': ['mean', 'median'],
}).head()

Outputs:

Out[1]:
                 TAVG              TMAX              TMIN
                 mean median       mean median       mean median
year month
1952 1      29.478261   29.0  32.608696   30.0  28.434783   28.0
     2      24.800000   26.0  29.000000   28.0  18.400000   19.0
     3      13.807692   10.5  26.423077   25.0   1.230769   -4.0
     4      39.607143   38.0  49.035714   48.0  30.285714   30.0
     5      44.666667   44.0  52.555556   54.0  33.629630   34.0

Upvotes: 0

Junaid_Ghauri
Junaid_Ghauri

Reputation: 31

b=pd.read_csv('b.dat')

b['date']=pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')

b.index=b['date']

b.index.month # will give you indexes of months (can access the month like this)

df.groupby(by=[b.index.month])

or year or day and then calcu;late the average simple.

Did you try this ? this is fast and efficient way.

Upvotes: 2

Related Questions