Reputation: 173
I have a pretty large dataframe (20M rows) containing workers. Some of the workers appear multiple times, given they might have previously worked in other firms. For each worker, I have information about first, last name (columns: first_name, last_name) and the starting date of each job (started_working_date). I'd like to create a new column in the dataframe (past_experience) that flags workers that were previously employed (in any firm).
In order to do so, I have to check whether, for each worker in my dataframe, there is a row (in the same dataframe) with the same first and last name and a earlier starting date than the one considered.
The dataframe looks like this:
first_name last_name started_working_date
0 Bob J 1995-01-01
1 John S 2000-01-01
1 Mark L 2001-01-01
1 Bob J 1997-01-01
The ideal result should be:
first_name last_name started_working_date past_experience
0 Bob J 1995-01-01 0
1 John S 2000-01-01 0
1 Mark L 2001-01-01 0
1 Bob J 1997-01-01 1
I tried to write a very simple apply function that filters the dataframe according to my conditions and returns 0 or 1 depending on the length of the resulting dataframe.
def past_experience(row):
filtered_df = my_df[(my_df['first_name'] == row['first_name']) & (my_df['last_name'] == row['last_name']) & (my_df['started_working_date'] < row['started_working_date'])]
if filtered_df.shape[0]>0:
return 1
else:
return 0
my_df['past_experience'] = my_df.apply(past_experience, axis=1)
This works, but it is extremely inefficient. Could you suggest a better solution?
Upvotes: 1
Views: 295
Reputation: 150745
Try:
groups = df.groupby(['first_name','last_name'])
df['employed'] = groups.started_working_date.cumcount()
df['employed_shift'] = groups.employed.shift().fillna(0)
df['employed_changed'] = (df['employed'] != df['employed_shift'])
And output:
+---+------------+-----------+----------------------+----------+----------------+
| | first_name | last_name | started_working_date | employed | employ_changed |
+---+------------+-----------+----------------------+----------+----------------+
| 0 | Bob | J | 1995-01-01 | 0 | False |
| 1 | John | S | 2000-01-01 | 0 | False |
| 1 | Mark | L | 2001-01-01 | 0 | False |
| 1 | Bob | J | 1997-01-01 | 1 | True |
+---+------------+-----------+----------------------+----------+----------------+
Upvotes: 1
Reputation: 323236
I am using numpy
broadcast , notice this method still a o(n*n) check , which mean if you data frame is big, memory will exceed.
s1=my_df['first_name'].values
s2=my_df['last_name'].values
s3=my_df['started_working_date'].values
np.any((s1==s1[:,None])&(s2==s2[:,None])&(s3<s3[:,None]),1)
Upvotes: 2