Reputation: 33
I have two tables that both contain an identifier called account code, but the first table can contain multiple occurences of that account code and the other table only has one occurence. My tables originally come from excel so they look like this after putting them into a pandas dataframe
base_data
+-------+----------------+----------+
| Name | Account Number | $ Amount |
+-------+----------------+----------+
| Brett | 1234 | a |
| Brett | 1234 | b |
| Jill | 2458 | c |
| Peter | 1485 | d |
+-------+----------------+----------+
licensee_fee
+----------------+--------------+
| Account Number | Licensee Fee |
+----------------+--------------+
| 1234 | x |
| 1485 | y |
+----------------+--------------+
So when I do
base_data = pd.read_excel(filename, sheet_name=0, dytpe={"Account Number": "str"})
licensee_fee = pd.read_excel(filename, sheet_name=1, dtype={"Account Number": "str"})
# the first 2 columns contain irrelevant data
result = pd.merge(base_date, licensee_fee.iloc[:,[2,3]], how="outer", on="Account Number")
as expected I get
+-------+----------------+----------+--------------+
| Name | Account Number | $ Amount | Licensee Fee |
+-------+----------------+----------+--------------+
| Brett | 1234 | a | x |
| Brett | 1234 | b | x |
| Jill | 2458 | c | - |
| Peter | 1485 | d | y |
+-------+----------------+----------+--------------+
but this is incorrect in terms of what I need. What I actually want is for it to look like this
+-------+----------------+----------+--------------+
| Name | Account Number | $ Amount | Licensee Fee |
+-------+----------------+----------+--------------+
| Brett | 1234 | a | x |
| Brett | 1234 | b | - |
| Jill | 2458 | c | - |
| Peter | 1485 | d | y |
+-------+----------------+----------+--------------+
where the licensee fee only comes across once.
I have some code to take care of the NULL values after so that's not a problem.
Upvotes: 3
Views: 80
Reputation: 375925
You can merge first, then take care of the NaN after:
In [11]: res = df.merge(df1, how='outer')
In [12]: res
Out[12]:
Name Account Number $Amount Licensee Fee
0 Brett 1234 a x
1 Brett 1234 b x
2 Jill 2458 c NaN
3 Peter 1485 d y
In [13]: res.loc[res.groupby("Account Number").cumcount() > 0, "Licensee Fee"] = np.nan
In [14]: res
Out[14]:
Name Account Number $Amount Licensee Fee
0 Brett 1234 a x
1 Brett 1234 b NaN
2 Jill 2458 c NaN
3 Peter 1485 d y
Upvotes: 1
Reputation: 323396
This is a good question , you may need using cumcount
create the merge
helpkey first , this will make sure once the fee item been used , it will not used again.
base['helpkey']=base.groupby('AccountNumber').cumcount()
fee['helpkey']=fee.groupby('AccountNumber').cumcount()
yourdf=base.merge(fee,on=['AccountNumber','helpkey'],how='left').drop('helpkey',1)
yourdf
Name AccountNumber $Amount LicenseeFee
0 Brett 1234 a x
1 Brett 1234 b NaN
2 Jill 2458 c NaN
3 Peter 1485 d y
Upvotes: 3