Reputation: 193
I have more rows in the resultant left join than are in the left dataframe.
# Importing Pandas and changing it's call to pd
import numpy as np
import pandas as pd
SalesDF = pd.read_csv(r"C:\Users\USER\Documents\Reports\SalesForAnalysis.csv")
print("This is the Sales shape")
print(SalesDF.shape)
CustInfoDF = pd.read_csv(r"C:\Users\USER\Documents\Cust.csv")
# This reassigns the df so that the rows with a NaN in the Account Number it doesn't appear
CustInfoDF = CustInfoDF[CustInfoDF['Account Number'].notna()]
# Merges the two dataframes on SalesDF with "Cust Number" as the key
MergeDF = pd.merge(SalesDF, CustInfoDF, how="left", left_on="Cust Number", right_on="Account Number")
print("This is the Merge Shape ")
print(MergeDF.shape)
# Reduced the number of columns to the selected columns
CutDF = MergeDF[["Customer", "Invoice #", "E-mail Address", "Phone", "Clerk", "Total", "Date"]]
CutDF.drop_duplicates()
print("This is the Cut shape ")
print(CutDF.shape)
Here is the result after running the program
This is the Sales shape
(5347, 61)
This is the Merge Shape
(6428, 83)
This is the Cut shape
(6428, 7)
Process finished with exit code 0
The CutDF should only have a max of 5347 rows. I have a drop_duplicates method in there but I still get the same results.
I saw this pandas left join - why more results? and inner join/merge in pandas dataframe give more rows than left dataframe but I didn't really see a solution to this in these though.
Any help would be appreciated.
Upvotes: 4
Views: 3481
Reputation: 484
Before you execute:
MergeDF = pd.merge(SalesDF, CustInfoDF, how="left", left_on="Cust Number", right_on="Account Number")
Can you do:
CustInfoDF = CustInfoDF.drop_duplicates(subset=["Account Number"])
I have a suspicion that your CustInforDF
has multiple entries for each Account Number
.
If this doesn't work, can you please post the sample data frames? Feel free to add/replace with dummy values as long as the code is replicable.
Upvotes: 5