Reputation: 995
I have two CSVs. One is about donations for politicians: each line is a donation and each line has the name of the donor company and the federal registration number of that company (CPF_CNPJ_doador - unique for each company in the country)
CPF_CNPJ_doador Nome_doador Valor
73668675000187 CWVGRAFICA EDITORA E BAZAR EIRELI LTDA ME 1000,00
83647909000163 CARBONIFERA CRICIUMA S/A 1750,00
10731057000114 HAROLDO AZEVEDO CONSTRUÇÕES LTDA 100,00
...
The other CSV has companies that have environmental irregularities. You have several data, including the company name and the federal registration number of that company (CPF_CNPJ - unique for each company in the country). But a company may appear more than once in this CSV - it may have multiple affiliates with irregularities for example
Nome_Razao_Social CPF_CNPJ
Carajas Madeiras Industria e Comercio Ltda - Me 02579504000214
Carbo Gás Ltda 03828695000435
Carbomil Química S/A 07645062000108
Carbomil Química S/A 07645062000108
...
I made a pandas merge (Python 3) of these two CSVs:
ibama_doadores = pd.merge(eleitos_d_doadores, ibama, left_on = 'CPF_CNPJ_doador', right_on = 'CPF_CNPJ')
The problem is that the command looks for the result matches in both CSVs, but it repeats the lines whenever the second CSV the CPF_CNPJ_doador appears more than once:
CPF_CNPJ_doador Nome_doador_originario
7645062000108.0 CARBOMIL QUIMICA S A
7645062000108.0 CARBOMIL QUIMICA S A
7645062000108.0 CARBOMIL QUIMICA S A
...
Please, is there a merge type that eliminates repeated values? If the item searched in the second database is repeated
Upvotes: 1
Views: 58
Reputation: 862641
I think one possible solution is remove duplicates in both DataFrames:
ibama_doadores = pd.merge(eleitos_d_doadores.drop_duplicates('CPF_CNPJ_doador'),
ibama.drop_duplicates('CPF_CNPJ'),
left_on = 'CPF_CNPJ_doador',
right_on = 'CPF_CNPJ')
Or create new columns for count duplicates and add this column for join:
eleitos_d_doadores['g'] = eleitos_d_doadores.groupby('CPF_CNPJ_doador').cumcount()
ibama['g'] = ibama.groupby('CPF_CNPJ').cumcount()
ibama_doadores = pd.merge(eleitos_d_doadores,
ibama,
left_on = ['CPF_CNPJ_doador','g'],
right_on = ['CPF_CNPJ','g']).drop('g', 1)
Samples:
eleitos_d_doadores = pd.DataFrame({
'CPF_CNPJ_doador': ['a','b','c','c','a'],
'B': list(range(1,6))
})
ibama = pd.DataFrame({
'CPF_CNPJ': ['a','b','a','a','c'],
'C': list(range(5))
})
ibama_doadores = pd.merge(eleitos_d_doadores.drop_duplicates('CPF_CNPJ_doador'),
ibama.drop_duplicates('CPF_CNPJ'),
left_on = 'CPF_CNPJ_doador',
right_on = 'CPF_CNPJ')
print (ibama_doadores)
B CPF_CNPJ_doador C CPF_CNPJ
0 1 a 0 a
1 2 b 1 b
2 3 c 4 c
eleitos_d_doadores['g'] = eleitos_d_doadores.groupby('CPF_CNPJ_doador').cumcount()
ibama['g'] = ibama.groupby('CPF_CNPJ').cumcount()
ibama_doadores = pd.merge(eleitos_d_doadores,
ibama,
left_on = ['CPF_CNPJ_doador','g'],
right_on = ['CPF_CNPJ','g']).drop('g', 1)
print (ibama_doadores)
B CPF_CNPJ_doador C CPF_CNPJ
0 1 a 0 a
1 2 b 1 b
2 3 c 4 c
3 5 a 2 a
Upvotes: 1