Reputation: 97
I have 2 dataframes preds
and assets_to_remove
.
This is how the dataframe preds
looks:
asset_id asset_name
294771 493646671302244 queue_bar
294770 503848157271852 refactor_target
294769 786314528522899 submission_tray
294768 206472013793428 state_subscriber
294767 510707746509671 format_gk
... ... ...
4 688122571800214 v2_reads
3 323798285353466 products_v2_reads
2 395943214896870 update_protocol
1 680526449474908 fix_v153
0 349458202857963 adjustment_v159
[294772 rows x 2 columns]
This is how the dataframe assets_to_remove
looks:
asset_id
0 513454469563578
1 829695400866900
2 764696234441014
3 195100021778259
4 368797654574209
.. ...
20 237207674121701
21 135774837852816
22 2453638234940010
23 705229516884471
24 343619773239104
[1995 rows x 1 columns]
Neither of these 2 dataframes have a row with asset_id
equal to 57412518735315968.
Checking preds
:
print(preds[preds.asset_id.eq(57412518735315968)])
Empty DataFrame
Columns: [asset_id, asset_name]
Index: []
Checking assets_to_remove
:
print(assets_to_remove[assets_to_remove.asset_id.eq(57412518735315968)])
Empty DataFrame
Columns: [asset_id]
Index: []
Now I do an outer join on these 2 dataframes:
z = pd.merge(preds,assets_to_remove,on="asset_id",how="outer",indicator="source").astype({"asset_id": "int64"})
Gives a result dataframe like this:
asset_id ... source
0 493646671302244 ... left_only
1 503848157271852 ... left_only
2 786314528522899 ... left_only
3 206472013793428 ... left_only
4 510707746509671 ... left_only
... ... ... ...
296016 743251236547292 ... right_only
296017 890822734697339 ... right_only
296018 274927503757939 ... right_only
296019 943962539702954 ... right_only
296020 2453638234940010 ... right_only
[296021 rows x 3 columns]
This dataframe that has a row with asset id 57412518735315968!
print(z[z.asset_id.eq(57412518735315968)])
asset_id asset_name source
216128 57412518735315968 storefront_ig_new_menu_items_internal left_only
How is this possible? Neither of the 2 dataframes have this value. I also made sure there are no duplicate rows in both the dataframes. Can someone please shed some light on this?
Upvotes: 1
Views: 439
Reputation: 806
Check datatype on asset_id on dataframes before the merge, is it int64 in both cases?
This issue could be possible as before the merge you are comparing to numeric value 57412518735315968, if type in original dataframes is not int64, and instead is object, then your equality check will not be returning a matching row.
In the merge step you are explicitly changing the data type of asset_id to int64, equality check will pass in this case.
Upvotes: 1