How to eliminate repeated values in pandas?

Question

I have two CSVs. One is about donations for politicians: each line is a donation and each line has the name of the donor company and the federal registration number of that company (CPF_CNPJ_doador - unique for each company in the country)

CPF_CNPJ_doador Nome_doador                               Valor
73668675000187  CWVGRAFICA EDITORA E BAZAR EIRELI LTDA ME 1000,00

83647909000163  CARBONIFERA CRICIUMA S/A 1750,00

10731057000114  HAROLDO AZEVEDO CONSTRUÇÕES LTDA 100,00
...

The other CSV has companies that have environmental irregularities. You have several data, including the company name and the federal registration number of that company (CPF_CNPJ - unique for each company in the country). But a company may appear more than once in this CSV - it may have multiple affiliates with irregularities for example

Nome_Razao_Social   CPF_CNPJ
Carajas Madeiras Industria e Comercio Ltda - Me 02579504000214
Carbo Gás Ltda  03828695000435
Carbomil Química S/A    07645062000108
Carbomil Química S/A    07645062000108
...

I made a pandas merge (Python 3) of these two CSVs:

ibama_doadores = pd.merge(eleitos_d_doadores, ibama, left_on = 'CPF_CNPJ_doador', right_on = 'CPF_CNPJ')

The problem is that the command looks for the result matches in both CSVs, but it repeats the lines whenever the second CSV the CPF_CNPJ_doador appears more than once:

CPF_CNPJ_doador Nome_doador_originario
7645062000108.0 CARBOMIL QUIMICA S A
7645062000108.0 CARBOMIL QUIMICA S A
7645062000108.0 CARBOMIL QUIMICA S A
...

Please, is there a merge type that eliminates repeated values? If the item searched in the second database is repeated

jezrael · Accepted Answer

I think one possible solution is remove duplicates in both DataFrames:

ibama_doadores = pd.merge(eleitos_d_doadores.drop_duplicates('CPF_CNPJ_doador'), 
                          ibama.drop_duplicates('CPF_CNPJ'), 
                          left_on = 'CPF_CNPJ_doador',
                          right_on = 'CPF_CNPJ')

Or create new columns for count duplicates and add this column for join:

eleitos_d_doadores['g'] = eleitos_d_doadores.groupby('CPF_CNPJ_doador').cumcount()
ibama['g'] = ibama.groupby('CPF_CNPJ').cumcount()

ibama_doadores = pd.merge(eleitos_d_doadores, 
                          ibama, 
                          left_on = ['CPF_CNPJ_doador','g'],
                          right_on = ['CPF_CNPJ','g']).drop('g', 1)

Samples:

eleitos_d_doadores = pd.DataFrame({
    'CPF_CNPJ_doador': ['a','b','c','c','a'],
    'B': list(range(1,6))
})

ibama = pd.DataFrame({
    'CPF_CNPJ': ['a','b','a','a','c'],
    'C': list(range(5))
})

ibama_doadores = pd.merge(eleitos_d_doadores.drop_duplicates('CPF_CNPJ_doador'), 
                          ibama.drop_duplicates('CPF_CNPJ'), 
                          left_on = 'CPF_CNPJ_doador',
                          right_on = 'CPF_CNPJ')
print (ibama_doadores)
   B CPF_CNPJ_doador  C CPF_CNPJ
0  1               a  0        a
1  2               b  1        b
2  3               c  4        c

eleitos_d_doadores['g'] = eleitos_d_doadores.groupby('CPF_CNPJ_doador').cumcount()
ibama['g'] = ibama.groupby('CPF_CNPJ').cumcount()

ibama_doadores = pd.merge(eleitos_d_doadores, 
                          ibama, 
                          left_on = ['CPF_CNPJ_doador','g'],
                          right_on = ['CPF_CNPJ','g']).drop('g', 1)

print (ibama_doadores)
   B CPF_CNPJ_doador  C CPF_CNPJ
0  1               a  0        a
1  2               b  1        b
2  3               c  4        c
3  5               a  2        a

How to eliminate repeated values in pandas?

Answers (1)

Related Questions