abautista
abautista

Reputation: 2780

How to drop duplicate values based in specific columns using pandas?

Currently, I imported the following data frame from Excel into pandas and I want to delete duplicate values based in the values of two columns.

# Python 3.5.2
# Pandas library version 0.22

import pandas as pd 

# Save the Excel workbook in a variable
current_workbook  = pd.ExcelFile('C:\\Users\\userX\\Desktop\\cost_values.xlsx')

# convert the workbook to a data frame
current_worksheet = pd.read_excel(current_workbook, index_col = 'vend_num') 

# current output
print(current_worksheet)


| vend_number |      vend_name         | quantity |  source  |
| ----------- |----------------------- | -------- | -------- | 
    CHARLS      Charlie & Associates      $5,700.00   Central
    CHARLS      Charlie & Associates      $5,700.00   South
    CHARLS      Charlie & Associates      $5,700.00   North
    CHARLS      Charlie & Associates      $5,700.00   West
    HUGHES      Hughinos                  $3,800.00   Central
    HUGHES      Hughinos                  $3,800.00   South
    FERNAS      Fernanda Industries       $3,500.00   South
    FERNAS      Fernanda Industries       $3,500.00   North
    FERNAS      Fernanda Industries       $3,000.00   West
    ....

What I want is to remove those duplicate values based in the columns quantity and source:

  1. Review the quantity and source column values:

    1.1. If the quantity of a vendor is equal in another row from the same vendor and source is not equal to Central then drop the repeated rows from this vendor except the row Central.

    1.2. Else if the quantity of a vendor is equal in another row from the same vendor and there is no source Central then drop the repeated rows.

Desired result

| vend_number |      vend_name         | quantity |  source  |
| ----------- |----------------------- | -------- | -------- | 
    CHARLS      Charlie & Associates      $5,700.00   Central
    HUGHES      Hughinos                  $3,800.00   Central
    FERNAS      Fernanda Industries       $3,500.00   South
    FERNAS      Fernanda Industries       $3,000.00   West
    ....

So far, I have tried the following code but pandas is not even detecting any duplicate rows.

print(current_worksheet.loc[current_worksheet.duplicated()])
print(current_worksheet.duplicated())

I have tried to figure out the solution but I am struggling quite a bit in this problem, so any help in this question is greatly appreciated. Feel free to improve the question.

Upvotes: 3

Views: 7470

Answers (2)

jpp
jpp

Reputation: 164613

Here is one way.

df['CentralFlag'] = (df['source'] == 'Central')

df = df.sort_values('CentralFlag', ascending=False)\
       .drop_duplicates(['vend_name', 'quantity'])\
       .drop('CentralFlag', 1)

#   vend_number           vend_name   quantity   source
# 0      CHARLS  Charlie&Associates  $5,700.00  Central
# 4      HUGHES            Hughinos  $3,800.00  Central
# 6      FERNAS  FernandaIndustries  $3,500.00    South
# 8      FERNAS  FernandaIndustries  $3,000.00     West

Explanation

  • Create a flag column, sort by this descending, so Central is prioritised.
  • Sort by vend_name and quantity, then drop the flag column.

Upvotes: 7

BENY
BENY

Reputation: 323226

You can do it two steps

s=df.loc[df['source']=='Central',:]
t=df.loc[~df['vend_number'].isin(s['vend_number']),:]

pd.concat([s,t.drop_duplicates(['vend_number','quantity'],keep='first')])

Upvotes: 1

Related Questions