Code optimisation - comparing two datetime columns by month and creating a new column too slow

Question

I am trying to create a new column in Pandas dataframe. If the other two date columns in my dataframe share the same month, then this new column should have 1 as a value, otherwise 0. Also, I need to check that ids match my other list of ids that I have saved previously in another place and mark those only with 1. I have some code but it is useless since I am dealing with almost a billion of rows.

my_list_of_ids = df[df.bool_column == 1].id.values

def my_func(date1, date2):
    for id_ in df.id:
        if id_ in my_list_of_ids:
            if date1.month == date2.month:
                my_var = 1
            else:
                my_var = 0
        else:
            my_var = 0
    return my_var

df["new_column"] = df.progress_apply(lambda x: my_func(x['date1'], x['date2']), axis=1)

Been waiting for 30 minutes and still 0%. Any help is appreciated.

UPDATE (adding an example):

id   |    date1   |    date2     | bool_column |  new_column |
id1    2019-02-13    2019-04-11     1                  0
id1    2019-03-15    2019-04-11     0                  0
id1    2019-04-23    2019-04-11     0                  1 
id2    2019-08-22    2019-08-11     1                  1
id2      ....
id3    2019-09-01    2019-09-30     1                  1
.
.
.

What I need to do is save the ids that are 1 in my bool_column, then I am looping through all of the ids in my dataframe and checking if they are in the previously created list (= 1). Then I want to compare month and the year of date1 and date2 columns and if they are the same, create a new_column with a value 1 where they mach, otherwise, 0.

user7870824 · Accepted Answer

The pandas way to do this is

mask = ((df['date1'].month == df['date2'].month) & (df['id'].isin(my_list_of_ids)))
df['new_column'] = mask.replace({False: 0, True: 1})

Since you have a large data-set, this will take time, but should be faster than using apply

Code optimisation - comparing two datetime columns by month and creating a new column too slow

Answers (2)

Related Questions