Eisen
Eisen

Reputation: 1887

Removing observations that have the same column values for a non unique id

I have a dataframe that has "tag information" on different companies for both iPad and Tablet platforms. Each "experiment" has an id which can occur multiple times depending on how many tags the experiment has. Experiments can be on iPad or Tablet (type), but i want to remove all of the duplicate experiments (the same experiment that appears in both iPad and Tablet). An experiment is a duplicate if it's from the same company and has the exact same tags. For example in the following dataframe Netflix is a duplicate because it has the same tags (Includes dropdown, Includes product list) for both iPad and Tablet. So either the tablet version or iPad version should be removed.

Input:

id  company   type       tag
1   Netflix   iPad       Includes dropdown
1   Netflix   iPad       Includes product list
2   Netflix   Tablet     Includes dropdown
2   Netflix   Tablet     Includes product list
3   Apple     iPad       Includes images
4   Apple     Tablet     Includes images

Output:

id  company   type       tag
2   Netflix   Tablet     Includes dropdown
2   Netflix   Tablet     Includes product list
3   Apple     iPad       Includes images
4   Apple     Tablet     Includes images

I'm looking for a solution in pandas python. How can i do this?

I've tried this

df.drop_duplicates(subset=['tag'], keep='last')

But i dont think solution works beacuse theres a possibility that there might be another experiment that is a different company but it contains the same tags. Therefore it will delete this instance even though it's not considered a duplicate.

Basically i want to drop ids that have the same tag for the same company.

Upvotes: 2

Views: 88

Answers (1)

xiao
xiao

Reputation: 81

I think you just need to add company name into your subset parameter. Let's build a dataframe you want:

id = [1, 1, 2, 2, 3, 4]
company = ['Netflix']*4 + ['Apple'] + ['New']
type = ['iPad', 'iPad', 'Tablet', 'Tablet', 'iPad', 'Tablet']
tag = ['Includes dropdown', 'Includes product list']*2 + ['Includes images']*2
data = {'id':id, 'company': company, 'type':type, 'tag':tag}
df = pd.DataFrame(data)

Print df and here is the dataframe: enter image description here

You see the id 3 and 4 have the same tag but different company names, like you mentioned, if we just use the code you tried:

df.drop_duplicates(subset=['tag'], keep='last')

We will get this:

enter image description here

In the above figure, id 3 was deleted which is what you want to avoid. However, if we just add company to subset:

df.drop_duplicates(subset=['company', 'tag'], keep='last')

We will get what you want: enter image description here

Upvotes: 1

Related Questions