jakun
jakun

Reputation: 674

Resampling timestamps with pandas: Why are Mondays counted for a wrong week?

I'm having a hard time trying to take my first steps with pandas. I would like to create a bar diagram showing how often something has happened a week. I want to identify the week by the first day of the week, which is a Monday in my case. Also, I want to ensure that the last week displayed is always the current week, even if nothing has happened this week.

import datetime
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.ticker import MultipleLocator

# create some example data frame
timestamps = [
    "2021-01-08 11:21:14",
    "2021-02-15 08:04:46",
    "2021-02-18 16:49:39",
    "2021-02-24 11:59:39",
    "2021-03-03 08:29:39",
]
df = pd.DataFrame(dict(timestamp=timestamps))
df.timestamp = df.timestamp.astype('datetime64')

# ensure that this week is contained
df = df.append(dict(timestamp=datetime.datetime.now()), ignore_index=True)

# process data to histogram
# TODO: Mondays are counted as the week before
histogram_df = df.resample('W-MON', label='left', on='timestamp').count()

# remove fake entry which I added to ensure that the current week appears
histogram_df['timestamp'][-1] -= 1

# plot the data
ax = histogram_df.plot(y='timestamp', legend=False, kind='bar', rot=0, title='number')
ax.set_xlabel('')
ax.set_xticklabels(map(lambda t: t.strftime('KW %V\n%d.%m.%Y'), histogram_df.index))
ax.yaxis.set_major_locator(MultipleLocator(1))

plt.show()

This does almost what I want but Mondays (see February 15th) are counted for the week before. Why is that? How can I get Mondays to be counted for the week they are in?

The documentation of resample does not say what str values it's first argument called rule accepts or what they mean. 'W-MON' is mentioned here but without much explanation.

My initial understanding was 'W-MON' means "weekly with weeks starting on Mondays" and label='left' means "take the first day of the week instead of the last day". But that has proven to be wrong. So what do 'W-MON' and label='left' really mean?

Bonus question: Appending a row to ensure that the current week appears in the diagram and then decrementing the last count is not exactly safe in case a value from the future were to appear in the data. Is there a better way to do this?

Upvotes: 4

Views: 1774

Answers (2)

jakun
jakun

Reputation: 674

@SeaBean has correctly pointed out that I need to add closed='left' in order to achieve what I wanted but without really saying what was going on and what this does. I think I am starting to understand what's going on here so let me give it a try at explaining it.

We have a timeline of events:

Timeline

With df.resample(...).count() I am splitting the timeline into several intervals and count the number of events in each interval. Several questions arise on how to do the resampling and their answers lead to the arguments we need to pass to the function call.

The first question is: How big are the time intervals and where do they start/end?
rule='W-MON' means "weekly, on every Monday".

Timeline with intervals

The second question is: How do I label these time intervals?
label='left' means "label them by the left border of the interval", in this case the start of the week.

The default (for weekly intervals) label='right' would mean "label the intervals by their right border, i.e. the next Monday". This makes sense if you don't explicitly specify a weekday because rule='W' is equivalent to rule='W-SUN'. So if you use rule='W' without other arguments that means "label the intervals by the end of the week (Sunday)".

The third question is: Which interval does an event belong to which is on the border between two intervals?
The time of the day does not seem to matter for weekly intervals, so let me "normalize" the time stamps first (i.e. set the time to 00:00) to make it clearer that an event is on the border between two intervals:

Normalized timeline

The answer to this question is the closed parameter and I am getting the feeling it makes sense to always pass the same value to it like to label. closed='left' means "the left border of the interval belongs to the interval and the right border of the interval belongs to the next interval". I have tried to visualize that with parentheses and square brackets:

Timeline with half open intervals

For more about open and closed intervals see Wikipedia.

I have created the graphics with the following code:

#!/usr/bin/env python3

import datetime
import pandas as pd
from matplotlib import pyplot as plt

FIGURE_TIMELINE = 1
FIGURE_TIMELINE_WEEKS = 2
FIGURE_TIMELINE_NORMALIZED = 3
FIGURE_TIMELINE_INTERVALS = 4

FIGURE = FIGURE_TIMELINE_INTERVALS

# create some example data frame
timestamps = [
    "2021-01-08 11:21:14",
    "2021-02-15 08:04:46",
    "2021-02-18 16:49:39",
    "2021-02-24 11:59:39",
    "2021-03-03 08:29:39",
]
df = pd.DataFrame(dict(timestamp=timestamps))
df.timestamp = df.timestamp.astype('datetime64')

if FIGURE >= FIGURE_TIMELINE_NORMALIZED:
    df.timestamp = df.timestamp.dt.normalize()

# draw time line
x0 = df.timestamp.min().normalize() - pd.offsets.Week(weekday=0)
x1 = datetime.datetime.now()
df.insert(0, 'zero', 0)
ax = df.plot(x='timestamp', y='zero', style='D', markersize=10, xlabel='', legend=False, xlim=(x0, x1), rot=0)
ax.spines['left'].set_position('zero')
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
if FIGURE == FIGURE_TIMELINE:
    ax.grid(which='major', axis='y')
    ax.set_yticks([0])
    ax.set_xticks([])
else:
    ax.grid(which='major', axis='both')
    ax.set_yticks([0])

    xticks = pd.date_range(x0, x1, freq='W-MON')
    ax.set_xticks(xticks)
    label_fmt = '%A\n%d.%m.%Y'
    if FIGURE >= FIGURE_TIMELINE_INTERVALS:
        label_fmt += '\n\n)['
    ax.set_xticklabels(xticks.strftime(label_fmt))
    for label in ax.xaxis.get_majorticklabels():
        label.set_horizontalalignment('center')

    week_labels = pd.date_range(x0+pd.offsets.Hour(12), x1, freq='W-THU')
    ax.set_xticks(week_labels, minor=True)
    ax.set_xticklabels(week_labels.strftime('CW%V'), minor=True)
    ax.tick_params(axis='x', which='minor',length=0, pad=-20)

border = .03
border_bottom = .3 if FIGURE >= FIGURE_TIMELINE_INTERVALS else .2 if FIGURE >= FIGURE_TIMELINE_WEEKS else border
plt.subplots_adjust(left=border, right=1-border, top=1-border, bottom=border_bottom)
plt.show()

Upvotes: 3

SeaBean
SeaBean

Reputation: 23217

Try using also closed='left' in the df.resample() call, like below:

histogram_df = df.resample('W-MON', label='left', closed='left', on='timestamp').count()

From the doc on the parm closed (extracted below), the default is right for frequency 'W':

closed{‘right’, ‘left’}, default None

Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

Upvotes: 3

Related Questions