Reputation: 45
I have a master data frame called Master, which contains all the ids of questions. I have multiple datasets containing those questions as header I want to replace those headers with their id's.
The Master table looks like this:
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
df1 looks like this:
gender marital status occupation
Male Single Doctor
Male Divorced Engineer
Desired output
1 2 3
Male Single Doctor
Male Divorced Engineer
also if any new variable comes in df1 which doesn't have id mentioned in the Master data table, a new ID should be given to it and variable name and id will be updated in master table
for eg.
df2 looks like this:
gender marital status country
Male Single India
Male Divorced UK
desired df2 :
1 2 4
Male Single India
Male Divorced UK
updated master table will be:
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
country 4
Upvotes: 2
Views: 1492
Reputation: 862741
Use DataFrame.rename
by Series
for set new columns names by another data:
df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
1 2 3
0 Male Single Doctor
1 Male Divorced Engineer
EDIT:
There are duplicates in Question
values in df
, so need create unique Question
values. One possible solution is remove duplicates by DataFrame.drop_duplicates
, here are sample data for see how it working:
print (df)
Question ID
0 gender 10 <-duplicates, change ID for test
1 gender 15 <-duplicates, change ID for test
2 what is your gender 1
3 sexual orientation 1
4 marital status 2
5 occupation 3
6 whats you job 3
You can test what are duplciates in real data:
print (df[df.duplicated('Question', keep=False)])
Question ID
0 gender 10
1 gender 15
Removed duplicates and keep first dupe row, here ID=10
:
print (df.drop_duplicates('Question').set_index('Question')['ID'])
Question
gender 10
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
Name: ID, dtype: int64
df21 = df1.rename(columns=df.drop_duplicates('Question').set_index('Question')['ID'])
print (df21)
10 2 3
0 Male Single Doctor
1 Male Divorced Engineer
Removed duplicates and keep first dupe row, here ID=15
:
print (df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
Question
gender 15
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
Name: ID, dtype: int64
df22 = df1.rename(columns=df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
print (df22)
15 2 3
0 Male Single Doctor
1 Male Divorced Engineer
print (df.set_index('Question')['ID'].to_dict())
{'gender': 15, 'what is your gender': 1, 'sexual orientation': 1, 'marital status': 2, 'occupation': 3, 'whats you job': 3}
df22 = df1.rename(columns=df.set_index('Question')['ID'].to_dict())
print (df22)
15 2 3
0 Male Single Doctor
1 Male Divorced Engineer
EDIT1: If values in master DataFrame not exist and is necessary first append them use:
print (df)
Question ID
0 gender 1
1 sex 1
2 what is your gender 1
3 sexual orientation 1
4 marital status 2
5 occupation 3
6 whats you job 3
print (df1)
gender marital status country code1 code2
0 Male Single India 4 7
1 Male Divorced UK 3 5
Get all columns which not exist in df['Question']
:
cols = df1.columns.difference(df['Question'].tolist(), sort=False)
print (cols)
Index(['country', 'code1', 'code2'], dtype='object')
Add ID
next by maximal value:
df3 = pd.DataFrame({'Question':cols,
'ID': np.arange(df['ID'].max() + 1, len(cols) + df['ID'].max() + 1)})
print (df3)
Question ID
0 country 4
1 code1 5
2 code2 6
Append to original master DataFrame
:
df = pd.concat([df, df3], ignore_index=True)
print (df)
Question ID
0 gender 1
1 sex 1
2 what is your gender 1
3 sexual orientation 1
4 marital status 2
5 occupation 3
6 whats you job 3
7 country 4
8 code1 5
9 code2 6
Last use original solution:
df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
1 2 4 5 6
0 Male Single India 4 7
1 Male Divorced UK 3 5
Upvotes: 2
Reputation: 5720
You can do a rename using the ID of the matching question:
df1.columns = [int(master[master.Question==c]['ID'].values) for c in df1.columns]
This should work for the multiple possible names for a given column.
Upvotes: 0