How to deal with duplicate "unique identifiers" in pandas

Question

I have two tables that both contain an identifier called account code, but the first table can contain multiple occurences of that account code and the other table only has one occurence. My tables originally come from excel so they look like this after putting them into a pandas dataframe

base_data

+-------+----------------+----------+
| Name  | Account Number | $ Amount |
+-------+----------------+----------+
| Brett |           1234 | a        |
| Brett |           1234 | b        |
| Jill  |           2458 | c        |
| Peter |           1485 | d        |
+-------+----------------+----------+

licensee_fee

+----------------+--------------+
| Account Number | Licensee Fee |
+----------------+--------------+
|           1234 | x            |
|           1485 | y            |
+----------------+--------------+

So when I do

base_data = pd.read_excel(filename, sheet_name=0, dytpe={"Account Number": "str"})

licensee_fee = pd.read_excel(filename, sheet_name=1, dtype={"Account Number": "str"})

# the first 2 columns contain irrelevant data
result = pd.merge(base_date, licensee_fee.iloc[:,[2,3]], how="outer", on="Account Number")

as expected I get

+-------+----------------+----------+--------------+
| Name  | Account Number | $ Amount | Licensee Fee |
+-------+----------------+----------+--------------+
| Brett |           1234 | a        | x            |
| Brett |           1234 | b        | x            |
| Jill  |           2458 | c        | -            |
| Peter |           1485 | d        | y            |
+-------+----------------+----------+--------------+

but this is incorrect in terms of what I need. What I actually want is for it to look like this

+-------+----------------+----------+--------------+
| Name  | Account Number | $ Amount | Licensee Fee |
+-------+----------------+----------+--------------+
| Brett |           1234 | a        | x            |
| Brett |           1234 | b        | -            |
| Jill  |           2458 | c        | -            |
| Peter |           1485 | d        | y            |
+-------+----------------+----------+--------------+

where the licensee fee only comes across once.

I have some code to take care of the NULL values after so that's not a problem.

BENY · Accepted Answer

This is a good question , you may need using cumcount create the merge helpkey first , this will make sure once the fee item been used , it will not used again.

base['helpkey']=base.groupby('AccountNumber').cumcount()
fee['helpkey']=fee.groupby('AccountNumber').cumcount()
yourdf=base.merge(fee,on=['AccountNumber','helpkey'],how='left').drop('helpkey',1)
yourdf
    Name  AccountNumber $Amount LicenseeFee
0  Brett           1234       a           x
1  Brett           1234       b         NaN
2   Jill           2458       c         NaN
3  Peter           1485       d           y

How to deal with duplicate "unique identifiers" in pandas

Answers (2)

Related Questions

How to deal with duplicate &quot;unique identifiers&quot; in pandas

Answers (2)

Related Questions

How to deal with duplicate "unique identifiers" in pandas