Jameson Welch
Jameson Welch

Reputation: 43

How do I remove duplicate rows in a Pandas DataFrame based on values in a specific column?

I have two dataframes that have duplicates but I need to remove only the rows that have duplicate VIN numbers and doesn't look at the other cells.

0  230  5UXCR6C50KTQ4xxxx  KLL34607  2019 BMW M3                        
1  116  5UXCR4C00LLW6xxxx  LLW63494  2020 BMW X5           Not Found      
2  109  5UXCR6C06LLL7xxxx  LLL76916  2020 BMW X5         Need Detail

38  229  5UXCR6C50KLL3xxxx  MLL23650  2019 BMW X5                    
43  115  5UXCR4C00LLW6xxxx  LLW63494  2020 BMW X5                    
37  108  5UXCR6C06LLL7xxxx  LLL76916  2020 BMW X5 

The last 2 rows look like different rows to pandas but I need to merge the two data frames and remove the rows just based on those VIN numbers and ignores the 'Not Found' and 'Need Detail'

I've tried .drop_duplicates .cumsum() and a few other methods but nothing seems to work.

Upvotes: 0

Views: 54

Answers (1)

DavidK
DavidK

Reputation: 2564

I think what you're trying to say is that you need to concatenate the two dataframes and then remove all duplicated rows based on only a subset of columns.

You can use pd.concat([df1, df2]).drop_duplicates(subset=['VIN'])

where subset is a list of column names that are used to drop the duplicated rows. (See the documentation for extra details)

Upvotes: 1

Related Questions