Pythonista anonymous
Pythonista anonymous

Reputation: 8950

Python pandas: can I speed up this apply statement?

I have a pandas dataframe with ca. 250,000 rows x 6 columns. One of the columns is a date, formatted as text. I need to do 3 things:

  1. Convert from text to date
  2. Create a date where months and year are the same as the converted date, but the day is always the 15th
  3. Calculate the date one month after the one calculated above

I do all of this with apply statements. They work, but seem slow to me: 7 seconds in total, whereas any SQL would take a fraction of a second, even without parallelisation, on the same machine. If this were a one-off I wouldn't invest time in speeding this up, but I must do it multiple times on multiple dataframes of similar size.

Is there any way I can speed up my code? Thanks a lot!

#this takes 3.1 seconds
df['date_reformatted'] = df['date_raw'].apply(lambda r: datetime.datetime.strptime(r, "%d/%m/%Y") )

# this takes 0.8 seconds
df['date_15']= df['date_reformatted'].apply(lambda r: datetime.date( r.year, r.month,15 ) ) 

# this takes 3.3 seconds
df['date_next_month']= df['date_15'].apply(lambda x: x + dateutil.relativedelta.relativedelta(months=1) )

Upvotes: 1

Views: 604

Answers (3)

Jeff
Jeff

Reputation: 128948

In [51]: df = pd.DataFrame({'date_raw': pd.to_datetime(['2000-12-31']*250000)}) 

In [66]: %timeit pd.DataFrame({'date_raw': pd.to_datetime(['2000-12-31']*250000)})
10 loops, best of 3: 47.4 ms per loop

In [52]: df       
Out[52]: 
         date_raw
0      2000-12-31
1      2000-12-31
2      2000-12-31
3      2000-12-31
4      2000-12-31
5      2000-12-31
...           ...
249994 2000-12-31
249995 2000-12-31
249996 2000-12-31
249997 2000-12-31
249998 2000-12-31
249999 2000-12-31

[250000 rows x 1 columns]

In [53]: df['date'] = pd.DatetimeIndex(df.date_raw).to_period('M').to_timestamp('D') + pd.Timedelta('14d')

In [54]: df
Out[54]: 
         date_raw       date
0      2000-12-31 2000-12-15
1      2000-12-31 2000-12-15
2      2000-12-31 2000-12-15
3      2000-12-31 2000-12-15
4      2000-12-31 2000-12-15
5      2000-12-31 2000-12-15
...           ...        ...
249994 2000-12-31 2000-12-15
249995 2000-12-31 2000-12-15
249996 2000-12-31 2000-12-15
249997 2000-12-31 2000-12-15
249998 2000-12-31 2000-12-15
249999 2000-12-31 2000-12-15

[250000 rows x 2 columns]

Timings

In [55]: %timeit pd.DatetimeIndex(df.date_raw).to_period('M').to_timestamp('D') + pd.Timedelta('14d')
10 loops, best of 3: 62.1 ms per loop

This will be a little more compact after this PR is merged. IOW, pd.DatetimeIndex(df.date_raw).to_period('M').to_timestamp('15D')

Your question 3 is easy if you again convert to periods, then adding 1 adds to of the same freq, month in this case. This is also vectorized.

In [80]: df['date_plus_1'] = (pd.DatetimeIndex(df.date).to_period('M') + 1).to_timestamp('D') + pd.Timedelta('14d')

In [81]: df
Out[81]: 
         date_raw       date date_plus_1
0      2000-12-31 2000-12-15  2001-01-15
1      2000-12-31 2000-12-15  2001-01-15
2      2000-12-31 2000-12-15  2001-01-15
3      2000-12-31 2000-12-15  2001-01-15
4      2000-12-31 2000-12-15  2001-01-15
5      2000-12-31 2000-12-15  2001-01-15
...           ...        ...         ...
249994 2000-12-31 2000-12-15  2001-01-15
249995 2000-12-31 2000-12-15  2001-01-15
249996 2000-12-31 2000-12-15  2001-01-15
249997 2000-12-31 2000-12-15  2001-01-15
249998 2000-12-31 2000-12-15  2001-01-15
249999 2000-12-31 2000-12-15  2001-01-15

[250000 rows x 3 columns]

In [82]: %timeit (pd.DatetimeIndex(df.date).to_period('M') + 1).to_timestamp('D') + pd.Timedelta('14d')
10 loops, best of 3: 56.7 ms per loop

Upvotes: 2

Alexander
Alexander

Reputation: 109546

Try just using integers and strings. Only convert to datetime objects if you really need them.

%%timeit -n10  df = pd.DataFrame({'date_raw': ['31/12/2000']*250000})
_, months, years = zip(*df.date_raw.str.split('/'))
months_years = [(1 if m == '12' else int(m) + 1, 
                 int(y) + 1 if m == '12' else int(y)) 
                for m, y in zip(months, years)]
# New dates in dd-mm-yyyy format:
df['new_date'] = ['15-{0}-{1}'.format(x[0], x[1]) for x in months_years]

10 loops, best of 3: 583 ms per loop

>>> df.tail()
          date_raw   new_date
249995  31/12/2000  15-1-2001
249996  31/12/2000  15-1-2001
249997  31/12/2000  15-1-2001
249998  31/12/2000  15-1-2001
249999  31/12/2000  15-1-2001

The new dates are in text form (which is why it was fast). Creating datetime objects is a bit time consuming, but if you really need them:

%%timeit
df['new_date'].apply(lambda r: datetime.datetime.strptime(r, "%d-%m-%Y") )

1 loops, best of 3: 2.72 s per loop

Upvotes: 2

JoeCondron
JoeCondron

Reputation: 8906

Yes, you can do

df['date_formatted'] = pd.to_datetime(df['date_raw'], format= "%d/%m/%Y")

The second bit is a bit strange and I can't see how to vectorize it but you could get both columns in a single loop by

pd.DataFrame([(datetime.date(d.year, d.month, 15), 
               datetime.date(d.year, d.month + 1, 15)) for d in df.date_formatted], 
               columns=['date_15', 'date_next_month'])

might be a bit faster.


Upvotes: 2

Related Questions