sectechguy
sectechguy

Reputation: 2117

Join Dataframes on columns with similar but not exact values

I have two pandas dataframes that I want to join on employee_id. One is Employee_Logs the other is HR_Data.

Employee_Logs_df

employee_id  action
2325255b     login     
51666164     login
51666164v    login
r1211        logoff
r18552421    login

HR_Data_df

employee_id  name
2325255      Rob    
51666164     Tom
r1211        Tammy
r18552421    Ron

I want to join them so that the data looks like this:

New_df

employee_id  action  name
2325255b     login   Rob  
51666164     login   Tom
51666164v    login   Tom
r1211        logoff  Tammy
r18552421    login   Ron

I could do an easy join if the employee_id field matched up on both tables, but the same user can have a "b" or a "v" after their employee id to specify if the account is elevated like an admin account. Some user accounts have an "r" in front of the id but that is the case in both tables.

In SQL this gives me the desired results:

select el*,
    coalesce(h.name, hv.name, hb.name) as name
from employee_logs el left join
    hr_data h
    on el.employee_id = h.employee_id left join
    hr_data hv
    on el.employee_id = concat(h.employee_id, 'v') left join
    hr_data hb
    on el.employee_id = concat(h.employee_id, 'b');

Is there a good way in Python where I can do some where actions and create a new df?

Upvotes: 2

Views: 386

Answers (1)

sacuL
sacuL

Reputation: 51395

You can strip the trailing letter from employee_id in Employee_Logs_df using a regex, then map that to your names from HR_Data_df:

new_df = Employee_Logs_df.assign(name = Employee_Logs_df.employee_id
                                 .str.extract('(.*\d+)',expand=False)
                                 .map(HR_Data_df.set_index('employee_id')['name']))

>>> new_df
  employee_id  action   name
0    2325255b   login    Rob
1    51666164   login    Tom
2   51666164v   login    Tom
3       r1211  logoff  Tammy
4   r18552421   login    Ron

You can do something very similar with a merge if you wanted:

new_df = Employee_Logs_df.merge(HR_Data_df.set_index('employee_id'),
                                left_on=Employee_Logs_df.employee_id
                                .str.extract('(.*\d+)',expand=False),
                                right_index=True)

>>> new_df
  employee_id  action   name
0    2325255b   login    Rob
1    51666164   login    Tom
2   51666164v   login    Tom
3       r1211  logoff  Tammy
4   r18552421   login    Ron

Upvotes: 3

Related Questions