Reputation: 1887
I have a dataframe that has "tag information" on different companies for both iPad and Tablet platforms. Each "experiment" has an id which can occur multiple times depending on how many tags the experiment has. Experiments can be on iPad or Tablet (type), but i want to remove all of the duplicate experiments (the same experiment that appears in both iPad and Tablet). An experiment is a duplicate if it's from the same company and has the exact same tags. For example in the following dataframe Netflix is a duplicate because it has the same tags (Includes dropdown, Includes product list) for both iPad and Tablet. So either the tablet version or iPad version should be removed.
Input:
id company type tag
1 Netflix iPad Includes dropdown
1 Netflix iPad Includes product list
2 Netflix Tablet Includes dropdown
2 Netflix Tablet Includes product list
3 Apple iPad Includes images
4 Apple Tablet Includes images
Output:
id company type tag
2 Netflix Tablet Includes dropdown
2 Netflix Tablet Includes product list
3 Apple iPad Includes images
4 Apple Tablet Includes images
I'm looking for a solution in pandas python. How can i do this?
I've tried this
df.drop_duplicates(subset=['tag'], keep='last')
But i dont think solution works beacuse theres a possibility that there might be another experiment that is a different company but it contains the same tags. Therefore it will delete this instance even though it's not considered a duplicate.
Basically i want to drop ids that have the same tag for the same company.
Upvotes: 2
Views: 88
Reputation: 81
I think you just need to add company name into your subset parameter. Let's build a dataframe you want:
id = [1, 1, 2, 2, 3, 4]
company = ['Netflix']*4 + ['Apple'] + ['New']
type = ['iPad', 'iPad', 'Tablet', 'Tablet', 'iPad', 'Tablet']
tag = ['Includes dropdown', 'Includes product list']*2 + ['Includes images']*2
data = {'id':id, 'company': company, 'type':type, 'tag':tag}
df = pd.DataFrame(data)
Print df and here is the dataframe:
You see the id 3 and 4 have the same tag but different company names, like you mentioned, if we just use the code you tried:
df.drop_duplicates(subset=['tag'], keep='last')
We will get this:
In the above figure, id 3 was deleted which is what you want to avoid. However, if we just add company to subset:
df.drop_duplicates(subset=['company', 'tag'], keep='last')
Upvotes: 1