Reputation: 767
I have a dataframe with millions of rows of ad ids scraped from a website on consecutive days, in the format:
ad-id Date scraped
SSE-AD-3469148 15-Apr
OAG-AD-12583686 15-Apr
OAG-AD-10476872 15-Apr
SSE-AD-3037758 15-Apr
OAG-AD-12583686 16-Apr
OAG-AD-10476872 16-Apr
SSE-AD-3037758 16-Apr
OAG-AD-10476872 17-Apr
SSE-AD-3037758 17-Apr
I want to output unique ad ids only, however I need to know the start and end scrape date for each ad id, like so:
ad-id First scrape Last Scrape
SSE-AD-3469148 15-Apr 15-Apr
OAG-AD-12583686 15-Apr 16-Apr
OAG-AD-10476872 15-Apr 17-Apr
SSE-AD-3037758 15-Apr 17-Apr
What is the most time efficient way of producing this using Pandas? I can only think of ways involving loops which for a few million rows will be very slow.
Upvotes: 2
Views: 160
Reputation: 109626
I can only hope you have real dates as opposed to "17-Apr" as a a text value.
It would probably be most efficient to first sort by dates, then perform a groupby
on the ad-id
taking the first and last values.
You can slightly improve the speed by electing not to sort the groupby results, e.g. df.groupby('ad-id', sort=False)...
df.sort_values('Date scraped', inplace=True)
>>> df.groupby('ad-id')['Date scraped'].agg({'First Scrape': 'first', 'Last Scrape': 'last'})
Last Scrape First Scrape
ad-id
OAG-AD-10476872 17-Apr 15-Apr
OAG-AD-12583686 16-Apr 15-Apr
SSE-AD-3037758 17-Apr 15-Apr
SSE-AD-3469148 15-Apr 15-Apr
Timings with 1 million records
np.random.seed(0)
ad_id = ['SSE-' + str(i) for i in np.random.random_integers(1, 500, 1000000)]
ts = pd.to_datetime(['{0}-{1}-{2}'.format(year, month, day) for year, month, day in zip(years, months, days)])
df = pd.DataFrame({'ad-id': ad_id, 'Date scraped': ts})
%%timeit -n 10
df.sort_values('Date scraped', inplace=True)
df.groupby('ad-id')['Date scraped'].agg({'First Scrape': 'first', 'Last Scrape': 'last'})
10 loops, best of 3: 277 ms per loop
>>> df.groupby('ad-id')['Date scraped'].agg({'First Scrape': 'first', 'Last Scrape': 'last'}).head()
Last Scrape First Scrape
ad-id
SSE-1 2015-12-28 2000-01-02
SSE-10 2015-12-25 2000-01-01
SSE-100 2015-12-25 2000-01-01
SSE-101 2015-12-26 2000-01-05
SSE-102 2015-12-28 2000-01-01
# Slightly faster if you don't sort the results.
%%timeit -n 10
df.sort_values('Date scraped', inplace=True)
df.groupby('ad-id', sort=False)['Date scraped'].agg({'First Scrape': 'first', 'Last Scrape': 'last'})
10 loops, best of 3: 268 ms per loop
Upvotes: 1