san
san

Reputation: 173

How to remove duplicates in a data frame using Python

So the data frame is

Product    Price  Weight  Range   Count
   A        40      20      1-3     20
   A        40      20      4-7     23
   B        20      73      1-3     54
   B        20      73      4-7     43
   B        20      73      8-15    34
   B        20      73      >=16    12
   C        10      20      4-7     22

So basically there is a product with price and weight and the range here specifies the no of days the product was sold continuously and the count specifies the count of products sold in that range

Expected Output

Product    Price  Weight  Range   Count
   A        40      20      1-3     20
                            4-7     23
   B        20      73      1-3     54
                            4-7     43
                            8-15    34
   B        20      73      >=16    12
   C        10      20      4-7     22

or

   Product  Price  Weight  1-3   4-7   8-15  >=16
   A        40      20     20     23   NaN    NaN
   B        20      73     54     43   34     1
   C        10      20      0     22   NaN    NaN                     

Upvotes: 2

Views: 116

Answers (3)

jezrael
jezrael

Reputation: 862891

In my opinion first solution is not recommended if need processes DataFrame later.

Second solution is much better and if duplicates in real data is necessary aggregate values, e.g. by sum:

#convert catagoricals to strings
df['Range'] = df['Range'].astype(str)

df = (df.groupby(['Product', 'Price', 'Weight', 'Range'])['Count']
        .sum()
        .unstack(fill_value=0)
        .reset_index())
print (df)
Range Product  Price  Weight  1-3  4-7  8-15  >=16
0           A     40      20   20   23     0     0
1           B     20      73   54   43    34    12
2           C     10      20    0   22     0     0

Upvotes: 2

Mohamed Thasin ah
Mohamed Thasin ah

Reputation: 11192

try this,

mask=df.duplicated(subset=['Product'])
df.loc[mask,['Product','Price','Weight']]=''

Output:

  Product Price Weight Range  Count
0       A    40     20   1-3     20
1                        4-7     23
2       B    20     73   1-3     54
3                        4-7     43
4                       8-15     34
5                       >=16     12
6       C    10    100   4-7     22

.

pd.pivot_table(df,index=['Product','Price','Weight'],columns='Range',values='Count').reset_index()

Output:

Range Product  Price  Weight   1-3   4-7  8-15  >=16
0           A     40      20  20.0  23.0   NaN   NaN
1           B     20      73  54.0  43.0  34.0  12.0
2           C     10     100   NaN  22.0   NaN   NaN

Upvotes: 1

cs95
cs95

Reputation: 402633

Fulfilling the second output makes more sense than the first. Use set_index, followed by unstack.

(df.set_index(['Product', 'Price', 'Weight', 'Range'])
  .Count
  .unstack(fill_value=0)
  .reset_index()
)

Range Product  Price  Weight  1-3  4-7  8-15  >=16
0           A     40      20   20   23     0     0
1           B     20      73   54   43    34    12
2           C     10     100    0   22     0     0

Upvotes: 3

Related Questions