Reputation: 241
I have a Pandas DataFrame of subscriptions, each with a start datetime (timestamp) and an optional end datetime (if they were canceled).
For simplicity, I have created string columns for the date (e.g. "20170901") based on start and end datetimes (timestamps). It looks like this:
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None), ...], columns=["sd", "ed"])
The end result should be a time series of how many subscriptions were active on any given date in a range.
To that end, I created an Index for all the days within a range:
days = df.groupby(["sd"])["sd"].count()
I am able to create what I am interested in with a loop each executing a query over the entire DataFrame df
.
count_by_day = pd.DataFrame([
len(df.loc[(df.sd <= i) & (df.ed.isnull() | (df.ed > i))])
for i in days.index], index=days.index)
Note that I have values for each day in the original dataset, so there are no gaps. I'm sure getting the date range can be improved.
The actual question is: is there an efficient way to compute this for a large initial dataset df, with multiple thousands of rows? It seems the method I used is quadratic in complexity. I've also tried df.query() but it's 66% slower than the Pythonic filter and does not change the complexity.
I tried to search the Pandas docs for examples but I seem to be using the wrong keywords. Any ideas?
Upvotes: 2
Views: 2737
Reputation: 241
Ok, I'm answering my own question after quite a bit of research, fiddling and trying things out. I may still be missing an obvious solution but maybe it helps.
The fastest solution I could find to date is (thanks Alex for some nice code patterns):
# Start with test data from question
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'),
('20170901', None), ...], columns=['sd', 'ed'])
# Convert to datetime columns
df['sd'] = pd.DatetimeIndex(df['sd'])
df['ed'] = pd.DatetimeIndex(df['ed'])
df.ed.fillna(df.sd.max(), inplace=True)
# Note: In my real data I have timestamps - I convert them like this:
#df['sd'] = pd.to_datetime(df['start_date'], unit='s').apply(lambda x: x.date())
# Set and sort multi-index to enable slices
df = df.set_index(['sd', 'ed'], drop=False)
df.sort_index(inplace=True)
# Compute the active counts by day in range
di = pd.DatetimeIndex(start=df.sd.min(), end=df.sd.max(), freq='D')
count_by_day = di.to_series().apply(lambda i: len(df.loc[
(slice(None, i.date()), slice(i.date(), None)), :]))
In my real dataset (with >10K rows for df
and a date range of about a year), this was twice as fast as the code in the question, about 1.5s.
Here some lessons I learned:
df
with df.apply
or df.itertuples
and incrementing counters was much slower. Curiously, apply
was slower than itertuples
. Don't even think of iterrows
df
and adding each date in the active range to the Series) and then grouping by date was much slower.df
with a multi-index did not change the performance.df
with a limited set of columns (my real dataset has 22 columns) did not change the performance.pd.crosstab
and pd.Period
but I was not able to get anything to workUpvotes: 2
Reputation: 579
It's an interesting problem, here's how I would do it. Not sure about performance
EDIT: My first answer was incorrect, I didn't read fully the question
# Initial data, columns as Timestamps
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None)], columns=["sd", "ed"])
df['sd'] = pd.DatetimeIndex(df.sd)
df['ed'] = pd.DatetimeIndex(df.ed)
# Range input and related index
beg = pd.Timestamp('2017-05-15')
end = pd.Timestamp('2017-09-15')
idx = pd.DatetimeIndex(start=beg, end=end, freq='D')
# We filter data for records out of the range and then clip the
# the subscriptions start/end to the range bounds.
fdf = df[(df.sd <= beg) | ((df.ed >= end) | (pd.isnull(df.ed)))]
fdf['ed'].fillna(end, inplace=True)
fdf['ps'] = fdf.sd.apply(lambda x: max(x, beg))
fdf['pe'] = fdf.ed.apply(lambda x: min(x, end))
# We run a conditional count
idx.to_series().apply(lambda x: len(fdf[(fdf.ps<=x) & (fdf.pe >=x)]))
Upvotes: 2