removing duplicate values based on multiple conditions through pandas

Question

Dataframe looks like this

APMC   Commodity    Year    Month   Price
1       A           2015    Jan     1232
1       A           2015    Jan     1654
2       A           2015    Jan     9897
2       A           2015    Feb     3467
2       B           2016    Jan     7878
2       B           2016    Feb     8545 
2       B           2016    Feb     3948

I want to remove the second and last row as the value of columns APMC, Year, Commodity and month is the same. How do I do this? The original data set is huge and I want to make changes in it(think of something like inplace=True).

Brad Solomon · Accepted Answer

You can specify columns on which to detect duplicates:

df.drop_duplicates(subset=['APMC', 'Year', 'Commodity', 'Month'], 
                   inplace=True)

Result:

>>> df
   APMC Commodity  Year Month  Price
0     1         A  2015   Jan   1232
2     2         A  2015   Jan   9897
3     2         A  2015   Feb   3467
4     2         B  2016   Jan   7878
5     2         B  2016   Feb   8545

Rows removed:

Column indices dropped:

>>> pd.RangeIndex(0, 7).difference(df.index)
Int64Index([1, 6], dtype='int64')

removing duplicate values based on multiple conditions through pandas

Answers (1)

Related Questions