mike921
mike921

Reputation: 241

Create a Pandas daily aggregate time series from a DataFrame with date ranges

I have a Pandas DataFrame of subscriptions, each with a start datetime (timestamp) and an optional end datetime (if they were canceled).

For simplicity, I have created string columns for the date (e.g. "20170901") based on start and end datetimes (timestamps). It looks like this:

df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None), ...], columns=["sd", "ed"])

The end result should be a time series of how many subscriptions were active on any given date in a range.

To that end, I created an Index for all the days within a range:

days = df.groupby(["sd"])["sd"].count()

I am able to create what I am interested in with a loop each executing a query over the entire DataFrame df.

count_by_day = pd.DataFrame([ len(df.loc[(df.sd <= i) & (df.ed.isnull() | (df.ed > i))]) for i in days.index], index=days.index)

Note that I have values for each day in the original dataset, so there are no gaps. I'm sure getting the date range can be improved.

The actual question is: is there an efficient way to compute this for a large initial dataset df, with multiple thousands of rows? It seems the method I used is quadratic in complexity. I've also tried df.query() but it's 66% slower than the Pythonic filter and does not change the complexity.

I tried to search the Pandas docs for examples but I seem to be using the wrong keywords. Any ideas?

Upvotes: 2

Views: 2737

Answers (2)

mike921
mike921

Reputation: 241

Ok, I'm answering my own question after quite a bit of research, fiddling and trying things out. I may still be missing an obvious solution but maybe it helps.

The fastest solution I could find to date is (thanks Alex for some nice code patterns):

# Start with test data from question
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'),
                   ('20170901', None), ...], columns=['sd', 'ed'])

# Convert to datetime columns
df['sd'] = pd.DatetimeIndex(df['sd'])
df['ed'] = pd.DatetimeIndex(df['ed'])
df.ed.fillna(df.sd.max(), inplace=True)

# Note: In my real data I have timestamps - I convert them like this:
#df['sd'] = pd.to_datetime(df['start_date'], unit='s').apply(lambda x: x.date())

# Set and sort multi-index to enable slices
df = df.set_index(['sd', 'ed'], drop=False)
df.sort_index(inplace=True)

# Compute the active counts by day in range
di = pd.DatetimeIndex(start=df.sd.min(), end=df.sd.max(), freq='D')
count_by_day = di.to_series().apply(lambda i: len(df.loc[
           (slice(None, i.date()), slice(i.date(), None)), :]))

In my real dataset (with >10K rows for df and a date range of about a year), this was twice as fast as the code in the question, about 1.5s.

Here some lessons I learned:

  • Creating a Series with counters for the date range and iterating through the dataset df with df.apply or df.itertuples and incrementing counters was much slower. Curiously, apply was slower than itertuples. Don't even think of iterrows
  • My dataset had a product_id with each row, so filtering the dataset for each product and running the calculation on the filtered result (for each product) was twice as fast as adding the product_id to the multi-index and slicing on that level too
  • Building an intermediate Series of active days (from iterating through each row in df and adding each date in the active range to the Series) and then grouping by date was much slower.
  • Running the code in the question on a df with a multi-index did not change the performance.
  • Running the code in the question on a df with a limited set of columns (my real dataset has 22 columns) did not change the performance.
  • I was looking at pd.crosstab and pd.Period but I was not able to get anything to work
  • Pandas is pretty awesome and trying to outsmart it is really hard (esp. non-vectorized in Python)

Upvotes: 2

Alex
Alex

Reputation: 579

It's an interesting problem, here's how I would do it. Not sure about performance

EDIT: My first answer was incorrect, I didn't read fully the question

# Initial data, columns as Timestamps
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None)], columns=["sd", "ed"])
df['sd'] = pd.DatetimeIndex(df.sd)
df['ed'] = pd.DatetimeIndex(df.ed)

# Range input and related index
beg = pd.Timestamp('2017-05-15')
end = pd.Timestamp('2017-09-15')
idx = pd.DatetimeIndex(start=beg, end=end, freq='D')

# We filter data for records out of the range and then clip the 
# the subscriptions start/end to the range bounds.
fdf = df[(df.sd <= beg) | ((df.ed >= end) | (pd.isnull(df.ed)))]
fdf['ed'].fillna(end, inplace=True)
fdf['ps'] = fdf.sd.apply(lambda x: max(x, beg))
fdf['pe'] = fdf.ed.apply(lambda x: min(x, end))

# We run a conditional count
idx.to_series().apply(lambda x: len(fdf[(fdf.ps<=x) & (fdf.pe >=x)]))

Upvotes: 2

Related Questions