Reputation: 2780
Currently, I imported the following data frame from Excel into pandas and I want to delete duplicate values based in the values of two columns.
# Python 3.5.2
# Pandas library version 0.22
import pandas as pd
# Save the Excel workbook in a variable
current_workbook = pd.ExcelFile('C:\\Users\\userX\\Desktop\\cost_values.xlsx')
# convert the workbook to a data frame
current_worksheet = pd.read_excel(current_workbook, index_col = 'vend_num')
# current output
print(current_worksheet)
| vend_number | vend_name | quantity | source |
| ----------- |----------------------- | -------- | -------- |
CHARLS Charlie & Associates $5,700.00 Central
CHARLS Charlie & Associates $5,700.00 South
CHARLS Charlie & Associates $5,700.00 North
CHARLS Charlie & Associates $5,700.00 West
HUGHES Hughinos $3,800.00 Central
HUGHES Hughinos $3,800.00 South
FERNAS Fernanda Industries $3,500.00 South
FERNAS Fernanda Industries $3,500.00 North
FERNAS Fernanda Industries $3,000.00 West
....
What I want is to remove those duplicate values based in the columns quantity and source:
Review the quantity and source column values:
1.1. If the quantity of a vendor is equal in another row from the same vendor and source is not equal to Central then drop the repeated rows from this vendor except the row Central.
1.2. Else if the quantity of a vendor is equal in another row from the same vendor and there is no source Central then drop the repeated rows.
Desired result
| vend_number | vend_name | quantity | source |
| ----------- |----------------------- | -------- | -------- |
CHARLS Charlie & Associates $5,700.00 Central
HUGHES Hughinos $3,800.00 Central
FERNAS Fernanda Industries $3,500.00 South
FERNAS Fernanda Industries $3,000.00 West
....
So far, I have tried the following code but pandas is not even detecting any duplicate rows.
print(current_worksheet.loc[current_worksheet.duplicated()])
print(current_worksheet.duplicated())
I have tried to figure out the solution but I am struggling quite a bit in this problem, so any help in this question is greatly appreciated. Feel free to improve the question.
Upvotes: 3
Views: 7470
Reputation: 164613
Here is one way.
df['CentralFlag'] = (df['source'] == 'Central')
df = df.sort_values('CentralFlag', ascending=False)\
.drop_duplicates(['vend_name', 'quantity'])\
.drop('CentralFlag', 1)
# vend_number vend_name quantity source
# 0 CHARLS Charlie&Associates $5,700.00 Central
# 4 HUGHES Hughinos $3,800.00 Central
# 6 FERNAS FernandaIndustries $3,500.00 South
# 8 FERNAS FernandaIndustries $3,000.00 West
Explanation
vend_name
and quantity
, then drop the flag column.Upvotes: 7
Reputation: 323226
You can do it two steps
s=df.loc[df['source']=='Central',:]
t=df.loc[~df['vend_number'].isin(s['vend_number']),:]
pd.concat([s,t.drop_duplicates(['vend_number','quantity'],keep='first')])
Upvotes: 1