Reputation: 739
I have a pandas Dataframe with the following shape: 12.000.000 x 2 (rows x columns) I need to apply a map function, however, it is taking so much time when it has to just compare every date of column 1 to a given date, for example, today.
Example of the DataFrame
╔════════════╦══════════╗
║ Col1 ║ Col2 ║
╠════════════╬══════════╣
║ 2019-03-19 ║ 1 ║
║ 2019-03-20 ║ 2 ║
║ 2019-05-15 ║ 3 ║
║ 2019-07-15 ║ 4 ║
║ ... ║ ║
║ 2019-10-20 ║ 12000000 ║
╚════════════╩══════════╝
Example of the code
import pandas as pd
from datetime import datetime
df = pd.read_csv('path_of_file.csv')
today = datetime.now()
df['output'] = df['Col1'].apply(lambda x: 1 if x > today else 0)
Am I missing something? Could it be improved? Thank you!
Upvotes: 1
Views: 881
Reputation: 3301
wwii solution is the clear winner out of the OP's and mine.
His solution runs 2x faster than my own:
df['output'] = 1 * (df['Col1'] > today)
It's a pretty neat one too, as all you're doing is multiplying 1 with either 1 or 0, resulting in the truth value of comparing the date column with today's date.
This was a really interesting question, so I ran some tests on my end.
I created an empty dataframe with 1 million rows of dates.
starting_date = datetime(200, 1, 1, 00, 00)
end_date = datetime(3000,1, 1, 00, 00)
index = 1
date_values = []
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + timedelta(n)
date_values = [_date for _date in daterange(starting_date, end_date)]
date_col = {'Col1': date_values}
df = pd.DataFrame(date_col)
We're going into the future boys.
Now, the two tests I ran compared the function run time of the solution the OP provided, and the solution I posted below.
start_time = time.time()
df['output'] = df['Col1'].apply(lambda x: 1 if x > today else 0)
print("--- %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df['output'] = 1
df.loc[df['Col1'] < today, 'output'] = 0
print("--- %s seconds ---" % (time.time() - start_time))
After running each function 10 times, the second solution won each time. Why? Honestly I have no idea.
I think we can make a good guess that under the hood, pandas is not performing a linear search when assigning a constant value to a column based on a condition, as demonstrated in the 2nd solution.
Soltuion 1
--- 0.36346006393432617 seconds ---
Solution 2
--- 0.13942289352416992 seconds ---
Soltuion 1
--- 0.4605379104614258 seconds ---
Solution 2
--- 0.12388873100280762 seconds ---
Soltuion 1
--- 0.34688305854797363 seconds ---
Solution 2
--- 0.0912778377532959 seconds ---
Soltuion 1
--- 0.2879600524902344 seconds ---
Solution 2
--- 0.08435988426208496 seconds ---
Soltuion 1
--- 0.3161609172821045 seconds ---
Solution 2
--- 0.0965569019317627 seconds ---
Soltuion 1
--- 0.31951212882995605 seconds ---
Solution 2
--- 0.08857107162475586 seconds ---
Soltuion 1
--- 0.2996959686279297 seconds ---
Solution 2
--- 0.16647815704345703 seconds ---
Soltuion 1
--- 0.5074219703674316 seconds ---
Solution 2
--- 0.13281011581420898 seconds ---
Soltuion 1
--- 0.3716299533843994 seconds ---
Solution 2
--- 0.0970299243927002 seconds ---
Soltuion 1
--- 0.29851794242858887 seconds ---
Solution 2
--- 0.08089780807495117 seconds ---
Something to consider - the dates in both tests are in order. What happens if you receive them in complete, random order?
We first randomize the dataset:
df = df.sample(frac=1)
Then run the exact same tests.
Soltuion 1
--- 0.6548967361450195 seconds ---
Solution 2
--- 0.22769808769226074 seconds ---
Soltuion 1
--- 0.7096188068389893 seconds ---
Solution 2
--- 0.28220510482788086 seconds ---
Soltuion 1
--- 0.7588798999786377 seconds ---
Solution 2
--- 0.25870585441589355 seconds ---
Soltuion 1
--- 0.6285257339477539 seconds ---
Solution 2
--- 0.3373727798461914 seconds ---
Soltuion 1
--- 0.7623891830444336 seconds ---
Solution 2
--- 0.18880391120910645 seconds ---
Soltuion 1
--- 0.5125689506530762 seconds ---
Solution 2
--- 0.23384499549865723 seconds ---
Soltuion 1
--- 0.6188468933105469 seconds ---
Solution 2
--- 0.25000977516174316 seconds ---
Soltuion 1
--- 0.6692302227020264 seconds ---
Solution 2
--- 0.5207180976867676 seconds ---
Soltuion 1
--- 1.2534172534942627 seconds ---
Solution 2
--- 0.2665679454803467 seconds ---
Soltuion 1
--- 0.6374101638793945 seconds ---
Solution 2
--- 0.2108619213104248 seconds ---
Since all you're doing is checking if the date is less than today's date, then create a new column and add a constant of either 1 or 0.
Lets first add the constant to the column.
df['Output'] = 1
Now, all we have to do is find the point where the date is less than the current date.
First though, we should change the date type of Col1 to a datetime, to make sure we can do proper comparisons.
df['Col1'] = pd.to_datetime(df['Col1'], format="%Y-%M-%d)
Then, we look through every date that's less than today, and change the output to 0.
df.loc[df['Col1'] < today.date(), 'Output'] = 0
Upvotes: 2
Reputation: 2702
While we're still awaiting some more information on the problem, here is what I have so far:
import pandas as pd
df = pd.DataFrame(
data={
"col_1": ["2019-03-19", "2019-03-20", "2030-01-01", "2019-05-15", "2019-07-15"],
"col_2": [1, 2, 3, 4, 5],
}
)
df["col_1"] = pd.to_datetime(df["col_1"], infer_datetime_format=True, utc=True)
print(df, end='\n\n')
curr_time = pd.Timestamp.utcnow()
print(curr_time, end='\n\n')
df["col_3"] = df["col_1"] > curr_time
print(df)
Output:
col_1 col_2
0 2019-03-19 00:00:00+00:00 1
1 2019-03-20 00:00:00+00:00 2
2 2030-01-01 00:00:00+00:00 3
3 2019-05-15 00:00:00+00:00 4
4 2019-07-15 00:00:00+00:00 5
2020-02-12 02:11:37.212849+00:00
col_1 col_2 col_3
0 2019-03-19 00:00:00+00:00 1 False
1 2019-03-20 00:00:00+00:00 2 False
2 2030-01-01 00:00:00+00:00 3 True
3 2019-05-15 00:00:00+00:00 4 False
4 2019-07-15 00:00:00+00:00 5 False
Upvotes: 1