eduardo_praun
eduardo_praun

Reputation: 51

Resample dataset

I am working on the Metro Interstate Traffic Volume data set (available here: http://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume) but I can't resample the dataset to show the average traffic volume per day instead of showing it per hour.

metro = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
metro['date_time'] = pd.to_datetime(metro['date_time'], format='%Y-%m-%d %H:%M:%S')
metro.set_index('date_time', inplace=True, drop=True)
metro.resample('1Y').mean()

This is what I get:

                    holiday    temp  ...     weather_description  traffic_volume
date_time                            ...                                        
2012-10-02 09:00:00    None  288.28  ...        scattered clouds            5545
2012-10-02 10:00:00    None  289.36  ...           broken clouds            4516
2012-10-02 11:00:00    None  289.58  ...         overcast clouds            4767
2012-10-02 12:00:00    None  290.13  ...         overcast clouds            5026
2012-10-02 13:00:00    None  291.14  ...           broken clouds            4918
...                     ...     ...  ...                     ...             ...
2018-09-30 19:00:00    None  283.45  ...           broken clouds            3543
2018-09-30 20:00:00    None  282.76  ...         overcast clouds            2781
2018-09-30 21:00:00    None  282.73  ...  proximity thunderstorm            2159
2018-09-30 22:00:00    None  282.09  ...         overcast clouds            1450
2018-09-30 23:00:00    None  282.12  ...         overcast clouds             954

[48204 rows x 8 columns]

Do you have any idea on how to solve it?


Edit:

Also, I checked the pandas reference for resampling (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html) and executed the following example code:

d = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
          'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
df = pd.DataFrame(d)
df['week_starting'] = pd.date_range('01/01/2018',
                                    periods=8,
                                    freq='W')
df
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
df.resample('M', on='week_starting').mean()
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

However, the result for me is the same before and after the resampling.

Upvotes: 2

Views: 186

Answers (2)

Renaud
Renaud

Reputation: 2819

You can try by creating a year column:

metro['year'] =metro['date_time'].dt.year
metro.groupby['year'].mean()

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150735

One year is not a fix time period: some years have 365 days, some have 366. You can use groupby:

metro = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
metro['date_time'] = pd.to_datetime(metro['date_time'], format='%Y-%m-%d %H:%M:%S')

# extract the yer
metro.groupby(metro['date_time'].dt.year).mean()

Output:

                 temp   rain_1h   snow_1h  clouds_all  traffic_volume
date_time                                                            
2012       274.991782  0.000000  0.000000   65.295819     3207.802657
2013       278.976352  0.161284  0.000000   52.560947     3286.762160
2014       276.786438  0.243251  0.000000   49.070469     3250.938004
2015       287.689574  0.339218  0.001795   40.988338     3242.900983
2016       282.520790  1.192969  0.000308   48.628842     3169.441328
2017       281.463309  0.000000  0.000000   50.005281     3340.703065
2018       282.851502  0.121765  0.000000   45.567996     3260.112341

Upvotes: 2

Related Questions