Michael
Michael

Reputation: 125

Code optimisation - comparing two datetime columns by month and creating a new column too slow

I am trying to create a new column in Pandas dataframe. If the other two date columns in my dataframe share the same month, then this new column should have 1 as a value, otherwise 0. Also, I need to check that ids match my other list of ids that I have saved previously in another place and mark those only with 1. I have some code but it is useless since I am dealing with almost a billion of rows.

my_list_of_ids = df[df.bool_column == 1].id.values

def my_func(date1, date2):
    for id_ in df.id:
        if id_ in my_list_of_ids:
            if date1.month == date2.month:
                my_var = 1
            else:
                my_var = 0
        else:
            my_var = 0
    return my_var

df["new_column"] = df.progress_apply(lambda x: my_func(x['date1'], x['date2']), axis=1)

Been waiting for 30 minutes and still 0%. Any help is appreciated.

UPDATE (adding an example):

id   |    date1   |    date2     | bool_column |  new_column |
id1    2019-02-13    2019-04-11     1                  0
id1    2019-03-15    2019-04-11     0                  0
id1    2019-04-23    2019-04-11     0                  1 
id2    2019-08-22    2019-08-11     1                  1
id2      ....
id3    2019-09-01    2019-09-30     1                  1
.
.
.

What I need to do is save the ids that are 1 in my bool_column, then I am looping through all of the ids in my dataframe and checking if they are in the previously created list (= 1). Then I want to compare month and the year of date1 and date2 columns and if they are the same, create a new_column with a value 1 where they mach, otherwise, 0.

Upvotes: 1

Views: 62

Answers (2)

sparc_spread
sparc_spread

Reputation: 10833

The best way to deal with the month match is to use vectorization in pandas and do this:

new_column = (df.date1.dt.month == df.date2.dt.month).astype(int)

That is, avoid using apply() over the DataFrame (which will probably be iterative) and take advantage of the underlying numpy vectorization. The gateway to such functionality is almost always in families of Series functions and properties, like the dt family for dates, str family for strings, and so forth.

Luckily, you have pre-computed the id_list membership in your bool_column, so to add membership as a criterion, just do this:

new_column = ((df.date1.dt.month == df.date2.dt.month) & df.bool_column).astype(int)

Once again, the & of two Series takes advantage of vectorization. You stay inside boolean space till the end, then cast to int with astype(int). Reviewing your code, it occurs to me that the iterative checking of your id_list may be the real performance hit here, even more so than the DataFrame.apply(). Whatever you do, avoid at all costs iterating your id_list at each row, since you already have a vector denoting membership in your bool_column.

By the way I believe there's a tiny error in your example data, the new_column value for your third row should be 0, since your bool_column value there is 0.

Upvotes: 1

user7870824
user7870824

Reputation:

The pandas way to do this is

mask = ((df['date1'].month == df['date2'].month) & (df['id'].isin(my_list_of_ids)))
df['new_column'] = mask.replace({False: 0, True: 1})

Since you have a large data-set, this will take time, but should be faster than using apply

Upvotes: 1

Related Questions