Reputation: 37
I have two csv files as my raw data to read into different dataframes. One is called 'employee' and another is called 'origin'. However, I cannot upload the files here so I hardcoded the data into the dataframes below. The task I'm trying to solve is to update the 'Eligible' column in employee_details with 'Yes' or 'No' based on the value of the 'Country' column in origin_details. If Country = UK, then put 'Yes' in the Eligible column for that Personal_ID. Else, put 'No'.
import pandas as pd
import numpy as np
employee = {
'Personal_ID': ['1000123', '1100258', '1104682', '1020943'],
'Name': ['Tom', 'Joseph', 'Krish', 'John'],
'Age': ['40', '35', '43', '51'],
'Eligible': ' '}
origin = {
'Personal_ID': ['1000123', '1100258', '1104682', '1020943', '1573482', '1739526'],
'Country': ['UK', 'USA', 'FRA', 'SWE', 'UK', 'AU']}
employee_details = pd.DataFrame(employee)
origin_details = pd.DataFrame(origin)
employee_details['Eligible'] = np.where((origin_details['Country']) == 'UK', 'Yes', 'No')
print(employee_details)
print(origin_details)
The output of above code shows the below error message:
ValueError: Length of values (6) does not match length of index (4)
However, I am expecting to see the below as my output.
Personal_ID Name Age Eligible
0 1000123 Tom 40 Yes
1 1100258 Joseph 35 No
2 1104682 Krish 43 No
3 1020943 John 51 No
I also don't want to delete anything in my dataframes to match the size specified in the ValueError message because I may need the extra Personal_IDs in the origin_details later. Alternatively, I can keep all the existing Personal_ID's in the raw data (employee_details, origin_details) and create a new dataframe from those to extract the records which have the same Personal_ID's and determine the np.where() condition from there.
Please advise! Any helps are appreciated, thank you!!
Upvotes: 0
Views: 57
Reputation: 482
Well, the first thing I want to answer about is the exception and how lucky you are that it didn't if your tables were the same length your code was going to work.
but there is an assumption in the code that I don't think you thought about and that is that the ids may not be in the same order or like in the example there are more ids in some table than the other if you had the same length of tables but not the same order you would have got incorrect eligible values for each row. the current way to do this is as follow
first join the table to one using personal_id but use left join as you don't want to lose data if there is no origin info for that personal id.
combine_df = pd.merge(employee_details, origin_details, on='Personal_ID', how='left')
use the apply function to fill the new column
combine_df['Eligible'] = combine_df['Country'].apply(lambda x:'Yes' if x=='UK' else 'No')
Upvotes: 1
Reputation: 11
You are using a Pandas Series object inside a Numpy method, np.where((origin_details['Country']))
. I believe this is the problem.
try:
employee_details['Eligible'] = origin_details['Country'].apply(lambda x:"Yes" if x=='UK' else "No")
It is always much easier and faster to use the pandas library to analyze dataframes instead of converting them back to numpy arrays
Upvotes: 1
Reputation: 2248
You can merge the 2 dataframes on Personal ID and then use np.where
Merge with how='outer'
to keep all personal IDs
df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='outer')
df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')
Personal_ID Name Age Eligible Country
0 1000123 Tom 40 Yes UK
1 1100258 Joseph 35 No USA
2 1104682 Krish 43 No FRA
3 1020943 John 51 No SWE
4 1573482 NaN NaN Yes UK
5 1739526 NaN NaN No AU
If you dont want to keep all personal IDs then you can merge with how='inner'
and you won't see the NANs
df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='inner')
df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')
Personal_ID Name Age Eligible Country
0 1000123 Tom 40 Yes UK
1 1100258 Joseph 35 No USA
2 1104682 Krish 43 No FRA
3 1020943 John 51 No SWE
Upvotes: 1