BirdBud
BirdBud

Reputation: 193

Pandas Left Join results in more rows than the Left Dataframe

I have more rows in the resultant left join than are in the left dataframe.

# Importing Pandas and changing it's call to pd
import numpy as np
import pandas as pd

SalesDF = pd.read_csv(r"C:\Users\USER\Documents\Reports\SalesForAnalysis.csv")
print("This is the Sales shape")
print(SalesDF.shape)


CustInfoDF = pd.read_csv(r"C:\Users\USER\Documents\Cust.csv")

# This reassigns the df so that the rows with a NaN in the Account Number it  doesn't appear
CustInfoDF = CustInfoDF[CustInfoDF['Account Number'].notna()]


# Merges the two dataframes on SalesDF with "Cust Number" as the key
MergeDF = pd.merge(SalesDF, CustInfoDF, how="left", left_on="Cust Number", right_on="Account Number")

print("This is the Merge Shape ")
print(MergeDF.shape)

# Reduced the number of columns to the selected columns
CutDF = MergeDF[["Customer", "Invoice #", "E-mail Address", "Phone", "Clerk", "Total", "Date"]]

CutDF.drop_duplicates()

print("This is the Cut shape ")
print(CutDF.shape)


Here is the result after running the program

This is the Sales shape
(5347, 61)
This is the Merge Shape 
(6428, 83)
This is the Cut shape 
(6428, 7)

Process finished with exit code 0

The CutDF should only have a max of 5347 rows. I have a drop_duplicates method in there but I still get the same results.

I saw this pandas left join - why more results? and inner join/merge in pandas dataframe give more rows than left dataframe but I didn't really see a solution to this in these though.

Any help would be appreciated.

Upvotes: 4

Views: 3481

Answers (1)

Parmandeep Chaddha
Parmandeep Chaddha

Reputation: 484

Before you execute:

MergeDF = pd.merge(SalesDF, CustInfoDF, how="left", left_on="Cust Number", right_on="Account Number")

Can you do:

CustInfoDF = CustInfoDF.drop_duplicates(subset=["Account Number"])

I have a suspicion that your CustInforDF has multiple entries for each Account Number.

If this doesn't work, can you please post the sample data frames? Feel free to add/replace with dummy values as long as the code is replicable.

Upvotes: 5

Related Questions