user01
user01

Reputation: 37

Update an existing column in one dataframe based on the value of a column in another dataframe

I have two csv files as my raw data to read into different dataframes. One is called 'employee' and another is called 'origin'. However, I cannot upload the files here so I hardcoded the data into the dataframes below. The task I'm trying to solve is to update the 'Eligible' column in employee_details with 'Yes' or 'No' based on the value of the 'Country' column in origin_details. If Country = UK, then put 'Yes' in the Eligible column for that Personal_ID. Else, put 'No'.

import pandas as pd
import numpy as np

employee = {
        'Personal_ID': ['1000123', '1100258', '1104682', '1020943'],
        'Name': ['Tom', 'Joseph', 'Krish', 'John'],
        'Age': ['40', '35', '43', '51'],
        'Eligible': ' '}

origin = {
        'Personal_ID': ['1000123', '1100258', '1104682', '1020943', '1573482', '1739526'],
        'Country': ['UK', 'USA', 'FRA', 'SWE', 'UK', 'AU']}


employee_details = pd.DataFrame(employee)

origin_details = pd.DataFrame(origin)

employee_details['Eligible'] = np.where((origin_details['Country']) == 'UK', 'Yes', 'No')

print(employee_details)
print(origin_details)

The output of above code shows the below error message:

ValueError: Length of values (6) does not match length of index (4)

However, I am expecting to see the below as my output.

  Personal_ID    Name Age Eligible
0     1000123     Tom  40    Yes     
1     1100258  Joseph  35    No   
2     1104682   Krish  43    No     
3     1020943    John  51    No

I also don't want to delete anything in my dataframes to match the size specified in the ValueError message because I may need the extra Personal_IDs in the origin_details later. Alternatively, I can keep all the existing Personal_ID's in the raw data (employee_details, origin_details) and create a new dataframe from those to extract the records which have the same Personal_ID's and determine the np.where() condition from there.

Please advise! Any helps are appreciated, thank you!!

Upvotes: 0

Views: 57

Answers (3)

gal peled
gal peled

Reputation: 482

Well, the first thing I want to answer about is the exception and how lucky you are that it didn't if your tables were the same length your code was going to work.

but there is an assumption in the code that I don't think you thought about and that is that the ids may not be in the same order or like in the example there are more ids in some table than the other if you had the same length of tables but not the same order you would have got incorrect eligible values for each row. the current way to do this is as follow

  1. first join the table to one using personal_id but use left join as you don't want to lose data if there is no origin info for that personal id.

    combine_df = pd.merge(employee_details, origin_details, on='Personal_ID', how='left')

  2. use the apply function to fill the new column

    combine_df['Eligible'] = combine_df['Country'].apply(lambda x:'Yes' if x=='UK' else 'No')

Upvotes: 1

Isuru Mahakumara
Isuru Mahakumara

Reputation: 11

You are using a Pandas Series object inside a Numpy method, np.where((origin_details['Country'])). I believe this is the problem.

try:

employee_details['Eligible'] = origin_details['Country'].apply(lambda x:"Yes" if x=='UK' else "No") 

It is always much easier and faster to use the pandas library to analyze dataframes instead of converting them back to numpy arrays

Upvotes: 1

Shubham Periwal
Shubham Periwal

Reputation: 2248

You can merge the 2 dataframes on Personal ID and then use np.where

Merge with how='outer' to keep all personal IDs

df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='outer')

df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')


    Personal_ID Name    Age    Eligible Country
0   1000123     Tom     40     Yes      UK
1   1100258     Joseph  35     No       USA
2   1104682     Krish   43     No       FRA
3   1020943     John    51     No       SWE
4   1573482     NaN     NaN    Yes      UK
5   1739526     NaN     NaN    No       AU

If you dont want to keep all personal IDs then you can merge with how='inner' and you won't see the NANs

df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='inner')

df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')

    Personal_ID Name    Age    Eligible Country
0   1000123     Tom     40     Yes      UK
1   1100258     Joseph  35     No       USA
2   1104682     Krish   43     No       FRA
3   1020943     John    51     No       SWE

Upvotes: 1

Related Questions