Hemant Sain
Hemant Sain

Reputation: 45

Match Column name stored in another dataframe and replace by its ID

I have a master data frame called Master, which contains all the ids of questions. I have multiple datasets containing those questions as header I want to replace those headers with their id's.

The Master table looks like this:

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3

df1 looks like this:

gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer

Desired output

   1            2                 3                 

   Male        Single            Doctor
   Male        Divorced          Engineer

also if any new variable comes in df1 which doesn't have id mentioned in the Master data table, a new ID should be given to it and variable name and id will be updated in master table

for eg.

df2 looks like this:

gender         marital status  country

Male           Single          India
Male           Divorced        UK

desired df2 :

1                 2              4

Male           Single          India
Male           Divorced        UK

updated master table will be:

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4

Upvotes: 2

Views: 1492

Answers (2)

jezrael
jezrael

Reputation: 862741

Use DataFrame.rename by Series for set new columns names by another data:

df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
      1         2         3
0  Male    Single    Doctor
1  Male  Divorced  Engineer

EDIT:

There are duplicates in Question values in df, so need create unique Question values. One possible solution is remove duplicates by DataFrame.drop_duplicates, here are sample data for see how it working:

print (df)
              Question  ID
0               gender  10 <-duplicates, change ID for test
1               gender  15 <-duplicates, change ID for test
2  what is your gender   1
3   sexual orientation   1
4       marital status   2
5           occupation   3
6        whats you job   3

You can test what are duplciates in real data:

print (df[df.duplicated('Question', keep=False)])
  Question  ID
0   gender  10
1   gender  15

Removed duplicates and keep first dupe row, here ID=10:

print (df.drop_duplicates('Question').set_index('Question')['ID'])
Question
gender                 10
what is your gender     1
sexual orientation      1
marital status          2
occupation              3
whats you job           3
Name: ID, dtype: int64

df21 = df1.rename(columns=df.drop_duplicates('Question').set_index('Question')['ID'])
print (df21)
     10        2         3 
0  Male    Single    Doctor
1  Male  Divorced  Engineer

Removed duplicates and keep first dupe row, here ID=15:

print (df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
Question
gender                 15
what is your gender     1
sexual orientation      1
marital status          2
occupation              3
whats you job           3
Name: ID, dtype: int64

df22 = df1.rename(columns=df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
print (df22)
     15        2         3 
0  Male    Single    Doctor
1  Male  Divorced  Engineer


print (df.set_index('Question')['ID'].to_dict())
{'gender': 15, 'what is your gender': 1, 'sexual orientation': 1, 'marital status': 2, 'occupation': 3, 'whats you job': 3}



df22 = df1.rename(columns=df.set_index('Question')['ID'].to_dict())
print (df22)
     15        2         3 
0  Male    Single    Doctor
1  Male  Divorced  Engineer

EDIT1: If values in master DataFrame not exist and is necessary first append them use:

print (df)
              Question  ID
0               gender   1
1                  sex   1
2  what is your gender   1
3   sexual orientation   1
4       marital status   2
5           occupation   3
6        whats you job   3

print (df1) 
  gender marital status country  code1  code2
0   Male         Single   India      4      7
1   Male       Divorced      UK      3      5

Get all columns which not exist in df['Question']:

cols = df1.columns.difference(df['Question'].tolist(), sort=False)
print (cols)
Index(['country', 'code1', 'code2'], dtype='object')

Add ID next by maximal value:

df3 = pd.DataFrame({'Question':cols, 
                    'ID': np.arange(df['ID'].max() + 1, len(cols) + df['ID'].max() + 1)})
print (df3) 
  Question  ID
0  country   4
1    code1   5
2    code2   6

Append to original master DataFrame:

df = pd.concat([df, df3], ignore_index=True)
print (df)
              Question  ID
0               gender   1
1                  sex   1
2  what is your gender   1
3   sexual orientation   1
4       marital status   2
5           occupation   3
6        whats you job   3
7              country   4
8                code1   5
9                code2   6

Last use original solution:

df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
      1         2      4  5  6
0  Male    Single  India  4  7
1  Male  Divorced     UK  3  5

Upvotes: 2

jeremy_rutman
jeremy_rutman

Reputation: 5720

You can do a rename using the ID of the matching question:

df1.columns = [int(master[master.Question==c]['ID'].values) for c in df1.columns]

This should work for the multiple possible names for a given column.

Upvotes: 0

Related Questions