Joshua Mercurio
Joshua Mercurio

Reputation: 33

How to deal with duplicate "unique identifiers" in pandas

I have two tables that both contain an identifier called account code, but the first table can contain multiple occurences of that account code and the other table only has one occurence. My tables originally come from excel so they look like this after putting them into a pandas dataframe

base_data

+-------+----------------+----------+
| Name  | Account Number | $ Amount |
+-------+----------------+----------+
| Brett |           1234 | a        |
| Brett |           1234 | b        |
| Jill  |           2458 | c        |
| Peter |           1485 | d        |
+-------+----------------+----------+

licensee_fee

+----------------+--------------+
| Account Number | Licensee Fee |
+----------------+--------------+
|           1234 | x            |
|           1485 | y            |
+----------------+--------------+

So when I do

base_data = pd.read_excel(filename, sheet_name=0, dytpe={"Account Number": "str"})

licensee_fee = pd.read_excel(filename, sheet_name=1, dtype={"Account Number": "str"})

# the first 2 columns contain irrelevant data
result = pd.merge(base_date, licensee_fee.iloc[:,[2,3]], how="outer", on="Account Number")

as expected I get

+-------+----------------+----------+--------------+
| Name  | Account Number | $ Amount | Licensee Fee |
+-------+----------------+----------+--------------+
| Brett |           1234 | a        | x            |
| Brett |           1234 | b        | x            |
| Jill  |           2458 | c        | -            |
| Peter |           1485 | d        | y            |
+-------+----------------+----------+--------------+

but this is incorrect in terms of what I need. What I actually want is for it to look like this

+-------+----------------+----------+--------------+
| Name  | Account Number | $ Amount | Licensee Fee |
+-------+----------------+----------+--------------+
| Brett |           1234 | a        | x            |
| Brett |           1234 | b        | -            |
| Jill  |           2458 | c        | -            |
| Peter |           1485 | d        | y            |
+-------+----------------+----------+--------------+

where the licensee fee only comes across once.

I have some code to take care of the NULL values after so that's not a problem.

Upvotes: 3

Views: 80

Answers (2)

Andy Hayden
Andy Hayden

Reputation: 375925

You can merge first, then take care of the NaN after:

In [11]: res = df.merge(df1, how='outer')

In [12]: res
Out[12]:
    Name  Account Number $Amount Licensee Fee
0  Brett            1234       a            x
1  Brett            1234       b            x
2   Jill            2458       c          NaN
3  Peter            1485       d            y

In [13]: res.loc[res.groupby("Account Number").cumcount() > 0, "Licensee Fee"] = np.nan

In [14]: res
Out[14]:
    Name  Account Number $Amount Licensee Fee
0  Brett            1234       a            x
1  Brett            1234       b          NaN
2   Jill            2458       c          NaN
3  Peter            1485       d            y

Upvotes: 1

BENY
BENY

Reputation: 323396

This is a good question , you may need using cumcount create the merge helpkey first , this will make sure once the fee item been used , it will not used again.

base['helpkey']=base.groupby('AccountNumber').cumcount()
fee['helpkey']=fee.groupby('AccountNumber').cumcount()
yourdf=base.merge(fee,on=['AccountNumber','helpkey'],how='left').drop('helpkey',1)
yourdf
    Name  AccountNumber $Amount LicenseeFee
0  Brett           1234       a           x
1  Brett           1234       b         NaN
2   Jill           2458       c         NaN
3  Peter           1485       d           y   

Upvotes: 3

Related Questions