Mar
Mar

Reputation: 411

group data by season according to the exact dates

i have a csv file containing 4 years of data and i am trying to group data per season over the 4 years , differently saying, i need to summarize and plot my whole data into 4 season only . here's a look on my data file :

timestamp,heure,lat,lon,impact,type
2006-01-01 00:00:00,13:58:43,33.837,-9.205,10.3,1
2006-01-02 00:00:00,00:07:28,34.5293,-10.2384,17.7,1
2007-02-01 00:00:00,23:01:03,35.0617,-1.435,-17.1,2
2007-02-02 00:00:00,01:14:29,36.5685,0.9043,36.8,1
2008-01-01 00:00:00,05:03:51,34.1919,-12.5061,-48.9,1
2008-01-02 00:00:00,05:03:51,34.1919,-12.5061,-48.9,1
....
2011-12-31 00:00:00,05:03:51,34.1919,-12.5061,-48.9,1

and here's my desired output :

winter     (the mean value of impacts)
summer     (the mean value of impacts)
autumn      ....
spring      .....

Actually i've tried this code :

names =["timestamp","heure","lat","lon","impact","type"]
data = pd.read_csv('flash.txt',names=names, parse_dates=['timestamp'],index_col=['timestamp'], dayfirst=True)

spring = range(80, 172)
summer = range(172, 264)
fall = range(264, 355)

def season(x):
    if x in spring:
       return 'Spring'
    if x in summer:
       return 'Summer'
    if x in fall:
       return 'Fall'
   else :
       return 'Winter'

 data['SEASON'] = data.index.to_series().dt.month.map(lambda x : season(x))
 data['impact'] = data['impact'].abs()
 seasonly = data.groupby('SEASON')['impact'].mean()

and i got this horrible result : enter image description here

where am i mistaken ?

Upvotes: 5

Views: 4724

Answers (3)

piRSquared
piRSquared

Reputation: 294358

pandas.cut
In order to properly handle 'Winter' being both at the beginning and end of the year, I shifted the dayofyear by 11 and took the results modulo 366. The reason I don't use the same technique as in the numpy solution below is that pd.cut returns a categorical type and I would end up with 5 categories in which two categories had the same label. I could then cast the result as string, but that felt sloppy.

data['SEASON'] = pd.cut(
    (data.index.dayofyear + 11) % 366,
    [0, 91, 183, 275, 366],
    labels=['Winter', 'Spring', 'Summer', 'Fall']
)

numpy.searchsorted
In order to properly handle 'Winter' being both at the beginning and end of the year, I allowed two bins for 'Winter'

seasons = np.array(['Winter', 'Spring', 'Summer', 'Fall', 'Winter'])
f = np.searchsorted([80, 172, 264, 355], data.index.dayofyear)
data['SEASON'] = seasons[f]

plot

data.groupby('SEASON')['impact'].mean().plot.bar()

enter image description here

Upvotes: 4

jezrael
jezrael

Reputation: 862921

You need DatetimeIndex.dayofyear:

data['SEASON'] = data.index.dayofyear.map(season)

Another solution with pandas.cut:

bins = [0, 91, 183, 275, 366]
labels=['Winter', 'Spring', 'Summer', 'Fall']
doy = data.index.dayofyear
data['SEASON1'] = pd.cut(doy + 11 - 366*(doy > 355), bins=bins, labels=labels)

Upvotes: 5

Looks like:

data['SEASON'] = data.index.to_series().dt.**month**.map(lambda x : season(x))

uses the month presumably 1-12 or 0-11 which are all "winter". You need to use the day of year.

But you could probably have seen this more easily and made it possible to print to check it yourself if you hadn't locked the extraction of the day away inside a one-liner. Just saying.

Upvotes: 2

Related Questions