Jane Borges
Jane Borges

Reputation: 592

How to merge two dataframes according to two columns and add a new column in the result?

I have two pandas dataframes:

import pandas as pd

df1 = pd.DataFrame({'id': [101, 102, 103, 104, 105, 106],
        'code': [1, 2, 3, 4, 5, 2],                   
        'address_name':['ESPLANADA DOM CARLOS',
                        'CHAMP DE MARS 5 AV ANATOLE',
                        'PARQUE NACIONAL DA TIJUCA',
                        'C D ARISTIDES MAILLOL',
                        'CHAMP DE MARS 5 AV ANATOLE',
                        'CHAMP DE MARS 5 AV ANATOLE']})
print(df1)
id   code   address_name
101   1     ESPLANADA DOM CARLOS
102   2     CHAMP DE MARS 5 AV ANATOLE
103   3     PARQUE NACIONAL DA TIJUCA
104   4     C D ARISTIDES MAILLOL
105   5     CHAMP DE MARS 5 AV ANATOLE
106   2     CHAMP DE MARS 5 AV ANATOLE
df2 = pd.DataFrame({'cod': [5, 2, 1],
        'population': [6748, 6748, 504],
        'address':['PARQUE NACIONAL DA TIJUCA',
                   'CHAMP DE MARS 5 AV ANATOLE',
                   'ESPLANADA DOM CARLOS']})

print(df2)
  cod   population     address
   5    6748          PARQUE NACIONAL DA TIJUCA
   2    6748          CHAMP DE MARS 5 AV ANATOLE
   1    504           ESPLANADA DOM CARLOS

I would like to make a join between the dataframes where the 'address_name' of df_1 is equal to the 'address' of df2 and the 'code' of df2 was the same as the 'cod' of df2. If both were true then I would like to save in a new column what are the 'id's that this happened.

I tried implementing the following code using merge():

  pd.merge(df2, df1[['address_name', 'code']], left_on="address", right_on="address_name", how='left')

The (wrong) output is being:

   cod   population     address                      address_name              code
    5       6748       PARQUE NACIONAL DA TIJUCA    PARQUE NACIONAL DA TIJUCA   3
    2       6748       CHAMP DE MARS 5 AV ANATOLE   CHAMP DE MARS 5 AV ANATOLE  2
    2       6748       CHAMP DE MARS 5 AV ANATOLE   CHAMP DE MARS 5 AV ANATOLE  5
    2       6748       CHAMP DE MARS 5 AV ANATOLE   CHAMP DE MARS 5 AV ANATOLE  2
    1        504       ESPLANADA DOM CARLOS         ESPLANADA DOM CARLOS        1

I would like the output to be:

   cod  population      address                      address_name          code    id
    2     6748     CHAMP DE MARS 5 AV ANATOLE   CHAMP DE MARS 5 AV ANATOLE  2    [102,106]
    1     504      ESPLANADA DOM CARLOS         ESPLANADA DOM CARLOS        1       [101]

Upvotes: 0

Views: 52

Answers (1)

user7864386
user7864386

Reputation:

merge on two columns and groupby all columns except "id" and create a list of "id"s:

merged_df = df1.merge(df2, left_on=['address_name','code'], right_on=['address','cod'])
out = merged_df.groupby(merged_df.drop('id', axis=1).columns.tolist())['id'].apply(list).reset_index()

Output:

    code  address_name                cod  population  address                   
0   1     ESPLANADA DOM CARLOS        1    504         ESPLANADA DOM CARLOS               [101]
1   2     CHAMP DE MARS 5 AV ANATOLE  2    6748        CHAMP DE MARS 5 AV ANATOLE    [102, 106]

Upvotes: 1

Related Questions