Ken Lawlor
Ken Lawlor

Reputation: 99

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken). I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable. I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:

    Id          Start_date  Count  Day_name_no
    149 2011-12-31 21:30:00      1            5  
    150 2011-12-31 20:51:00      1            0  
    259 2011-12-31 20:48:00      1            1  
    3015 2011-12-31 19:38:00     1            4  
    28 2011-12-31 19:37:00       1            4  

Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.

My alternative approach is to split the hour out of the date and have the following:

    Id  Count  Day_name_no  Trip_hour
    149      1            2         5
    150      1            4         10
    153      1            2         15
    1867     1            4         11
    2387     1            2         7

I then get the count first with:

Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()

     Id Day_name_no Trip_hour   Count
     1  0           7          24
     1  0           8          48
     1  0           9          31
     1  0           10         28
     1  0           11         26
     1  0           12         25

Then use groupby and mean:

Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()

However, this does not give the desired result as the mean values are incorrect. I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.

Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.

Thanks in advance.

I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable. A toy dataset that i start with:

    Date        Id     Dow Hour Count
    12/12/2014  1234    0   9   1
    12/12/2014  1234    0   9   1
    12/12/2014  1234    0   9   1
    12/12/2014  1234    0   9   1
    12/12/2014  1234    0   9   1
    19/12/2014  1234    0   9   1
    19/12/2014  1234    0   9   1
    19/12/2014  1234    0   9   1
    26/12/2014  1234    0   10  1
    27/12/2014  1234    1   11  1
    27/12/2014  1234    1   11  1
    27/12/2014  1234    1   11  1
    27/12/2014  1234    1   11  1
    04/01/2015  1234    1   11  1

I now realise I would have to use the date first and get something like:

    Date         Id    Dow Hour Count
    12/12/2014  1234    0   9   5
    19/12/2014  1234    0   9   3
    26/12/2014  1234    0   10  1
    27/12/2014  1234    1   11  4
    04/01/2015  1234    1   11  1

And then calculate the mean per Id, per Dow, per hour. And want to get this:

    Id  Dow Hour    Mean
    1234    0   9   4
    1234    0   10  1
    1234    1   11  2.5

I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.

Upvotes: 3

Views: 3260

Answers (2)

CaMaDuPe85
CaMaDuPe85

Reputation: 89

You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Upvotes: 0

Def_Os
Def_Os

Reputation: 5467

Your question is not very clear, but I hope this helps:

df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()

Upvotes: 4

Related Questions