Map function taking too much time (Pandas DataFrame)

Question

I have a pandas Dataframe with the following shape: 12.000.000 x 2 (rows x columns) I need to apply a map function, however, it is taking so much time when it has to just compare every date of column 1 to a given date, for example, today.

Example of the DataFrame

╔════════════╦══════════╗
║    Col1    ║   Col2   ║
╠════════════╬══════════╣
║ 2019-03-19 ║        1 ║
║ 2019-03-20 ║        2 ║
║ 2019-05-15 ║        3 ║
║ 2019-07-15 ║        4 ║
║ ...        ║          ║
║ 2019-10-20 ║ 12000000 ║
╚════════════╩══════════╝

Example of the code

import pandas as pd
from datetime import datetime

df = pd.read_csv('path_of_file.csv')
today = datetime.now()
df['output'] = df['Col1'].apply(lambda x: 1 if x > today else 0)

Am I missing something? Could it be improved? Thank you!

alex067 · Accepted Answer

EDIT - See wwii solution

wwii solution is the clear winner out of the OP's and mine.

His solution runs 2x faster than my own:

df['output'] = 1 * (df['Col1'] > today)

It's a pretty neat one too, as all you're doing is multiplying 1 with either 1 or 0, resulting in the truth value of comparing the date column with today's date.

This was a really interesting question, so I ran some tests on my end.

I created an empty dataframe with 1 million rows of dates.

starting_date = datetime(200, 1, 1, 00, 00)
end_date = datetime(3000,1, 1, 00, 00)
index = 1

date_values = []

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)

date_values = [_date for _date in daterange(starting_date, end_date)]

date_col = {'Col1': date_values}
df = pd.DataFrame(date_col)

We're going into the future boys.

Now, the two tests I ran compared the function run time of the solution the OP provided, and the solution I posted below.

We are assuming the dates are in order

Test 1 - OP's solution

start_time = time.time()

df['output'] = df['Col1'].apply(lambda x: 1 if x > today else 0) 

print("--- %s seconds ---" % (time.time() - start_time))

Test 2 - My solution

start_time = time.time()

df['output'] = 1

df.loc[df['Col1'] < today, 'output'] = 0

print("--- %s seconds ---" % (time.time() - start_time))

The results

After running each function 10 times, the second solution won each time. Why? Honestly I have no idea.

I think we can make a good guess that under the hood, pandas is not performing a linear search when assigning a constant value to a column based on a condition, as demonstrated in the 2nd solution.

Soltuion 1
--- 0.36346006393432617 seconds ---
Solution 2
--- 0.13942289352416992 seconds ---
Soltuion 1
--- 0.4605379104614258 seconds ---
Solution 2
--- 0.12388873100280762 seconds ---
Soltuion 1
--- 0.34688305854797363 seconds ---
Solution 2
--- 0.0912778377532959 seconds ---
Soltuion 1
--- 0.2879600524902344 seconds ---
Solution 2
--- 0.08435988426208496 seconds ---
Soltuion 1
--- 0.3161609172821045 seconds ---
Solution 2
--- 0.0965569019317627 seconds ---
Soltuion 1
--- 0.31951212882995605 seconds ---
Solution 2
--- 0.08857107162475586 seconds ---
Soltuion 1
--- 0.2996959686279297 seconds ---
Solution 2
--- 0.16647815704345703 seconds ---
Soltuion 1
--- 0.5074219703674316 seconds ---
Solution 2
--- 0.13281011581420898 seconds ---
Soltuion 1
--- 0.3716299533843994 seconds ---
Solution 2
--- 0.0970299243927002 seconds ---
Soltuion 1
--- 0.29851794242858887 seconds ---
Solution 2
--- 0.08089780807495117 seconds ---

Something to consider - the dates in both tests are in order. What happens if you receive them in complete, random order?

We first randomize the dataset:

df = df.sample(frac=1)

Then run the exact same tests.

Soltuion 1
--- 0.6548967361450195 seconds ---
Solution 2
--- 0.22769808769226074 seconds ---
Soltuion 1
--- 0.7096188068389893 seconds ---
Solution 2
--- 0.28220510482788086 seconds ---
Soltuion 1
--- 0.7588798999786377 seconds ---
Solution 2
--- 0.25870585441589355 seconds ---
Soltuion 1
--- 0.6285257339477539 seconds ---
Solution 2
--- 0.3373727798461914 seconds ---
Soltuion 1
--- 0.7623891830444336 seconds ---
Solution 2
--- 0.18880391120910645 seconds ---
Soltuion 1
--- 0.5125689506530762 seconds ---
Solution 2
--- 0.23384499549865723 seconds ---
Soltuion 1
--- 0.6188468933105469 seconds ---
Solution 2
--- 0.25000977516174316 seconds ---
Soltuion 1
--- 0.6692302227020264 seconds ---
Solution 2
--- 0.5207180976867676 seconds ---
Soltuion 1
--- 1.2534172534942627 seconds ---
Solution 2
--- 0.2665679454803467 seconds ---
Soltuion 1
--- 0.6374101638793945 seconds ---
Solution 2
--- 0.2108619213104248 seconds ---

The solution

Since all you're doing is checking if the date is less than today's date, then create a new column and add a constant of either 1 or 0.

Lets first add the constant to the column.

df['Output'] = 1

Now, all we have to do is find the point where the date is less than the current date.

First though, we should change the date type of Col1 to a datetime, to make sure we can do proper comparisons.

df['Col1'] = pd.to_datetime(df['Col1'], format="%Y-%M-%d)

Then, we look through every date that's less than today, and change the output to 0.

df.loc[df['Col1'] < today.date(), 'Output'] = 0