Reputation: 125
I am trying to create a new column in Pandas dataframe. If the other two date columns in my dataframe share the same month, then this new column should have 1 as a value, otherwise 0. Also, I need to check that ids match my other list of ids that I have saved previously in another place and mark those only with 1. I have some code but it is useless since I am dealing with almost a billion of rows.
my_list_of_ids = df[df.bool_column == 1].id.values
def my_func(date1, date2):
for id_ in df.id:
if id_ in my_list_of_ids:
if date1.month == date2.month:
my_var = 1
else:
my_var = 0
else:
my_var = 0
return my_var
df["new_column"] = df.progress_apply(lambda x: my_func(x['date1'], x['date2']), axis=1)
Been waiting for 30 minutes and still 0%. Any help is appreciated.
UPDATE (adding an example):
id | date1 | date2 | bool_column | new_column |
id1 2019-02-13 2019-04-11 1 0
id1 2019-03-15 2019-04-11 0 0
id1 2019-04-23 2019-04-11 0 1
id2 2019-08-22 2019-08-11 1 1
id2 ....
id3 2019-09-01 2019-09-30 1 1
.
.
.
What I need to do is save the ids that are 1 in my bool_column, then I am looping through all of the ids in my dataframe and checking if they are in the previously created list (= 1). Then I want to compare month and the year of date1 and date2 columns and if they are the same, create a new_column with a value 1 where they mach, otherwise, 0.
Upvotes: 1
Views: 62
Reputation: 10833
The best way to deal with the month match is to use vectorization in pandas and do this:
new_column = (df.date1.dt.month == df.date2.dt.month).astype(int)
That is, avoid using apply()
over the DataFrame
(which will probably be iterative) and take advantage of the underlying numpy
vectorization. The gateway to such functionality is almost always in families of Series
functions and properties, like the dt
family for dates, str
family for strings, and so forth.
Luckily, you have pre-computed the id_list membership in your bool_column
, so to add membership as a criterion, just do this:
new_column = ((df.date1.dt.month == df.date2.dt.month) & df.bool_column).astype(int)
Once again, the &
of two Series
takes advantage of vectorization. You stay inside boolean space till the end, then cast to int
with astype(int)
. Reviewing your code, it occurs to me that the iterative checking of your id_list may be the real performance hit here, even more so than the DataFrame.apply()
. Whatever you do, avoid at all costs iterating your id_list at each row, since you already have a vector denoting membership in your bool_column
.
By the way I believe there's a tiny error in your example data, the new_column
value for your third row should be 0
, since your bool_column
value there is 0
.
Upvotes: 1
Reputation:
The pandas way to do this is
mask = ((df['date1'].month == df['date2'].month) & (df['id'].isin(my_list_of_ids)))
df['new_column'] = mask.replace({False: 0, True: 1})
Since you have a large data-set, this will take time, but should be faster than using apply
Upvotes: 1