Andres
Andres

Reputation: 183

Pandas: filter unique values in groups

I have a dataframe with sales information in a supermarket. Each row in the dataframe represents an item, with several characteristics as columns. The original DataFrame is something like this:

In [1]: import pandas as pd
        my_data = [{'ticket_number' : '001', 'ITEM' : 'vegetable', 'ticket_line' : '1'},
               {'TICKET_NUMBER' : '001', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '001', 'ITEM' : 'soup', 'TICKET_ROW' : '3'},
               {'TICKET_NUMBER' : '002', 'ITEM' : 'soup', 'TICKET_ROW' : '1'},
               {'TICKET_NUMBER' : '002', 'ITEM' : 'drink', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '1'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '3'}]
        df = pd.DataFrame(my_data)

In [2]: df
Out [2]:    
            TICKET_NUMBER   TICKET_ROW        ITEM
         0        001            1           vegetable
         1        001            2           vegetable
         2        001            3           soup
         3        002            1           soup
         4        002            2           drink
         5        003            1           meat
         6        003            2           vegetable
         7        003            3           meat

I want to filter out duplicated items that belong to the same ticket. For example, in the first ticket (TICKET_NUMBER==001), there are 2 vegetables, so I want to delete 1 of them. The same happens in ticket number 003 with meat.

So, the final dataset would look like this:

        TICKET_NUMBER   TICKET_ROW        ITEM
     0        001            1           vegetable
     1        001            3           soup
     2        002            1           soup
     3        002            2           drink
     4        003            1           meat
     5        003            2           vegetable

My guess was to groupby TICKET_NUMBER, then filter ITEM by unique(), (df.groupby(['TICKET_NUMBER','TICKET_ROW'])['ITEM'].unique()). Once I have the unique values, I would like to reverse those groups (kind of "ungroupby") to a DataFrame. Is that possible?

I'm sure there are other ways of doing what I'm looking for. Please, help!

Thank you!

Upvotes: 6

Views: 7872

Answers (1)

DSM
DSM

Reputation: 353059

I think you're close. It looks like taking the first TICKET_ROW in the case of duplicates would suffice, and we can use as_index=False to keep things looking like the original dataframe. So we can group by TICKET_NUMBER and ITEM and take the first TICKET_ROW:

df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()

which gives

In [46]: df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()
Out[46]: 
  TICKET_NUMBER       ITEM TICKET_ROW
0           001  vegetable          1
1           001       soup          3
2           002       soup          1
3           002      drink          2
4           003       meat          1
5           003  vegetable          2

Upvotes: 7

Related Questions