Reputation:

Python/Pandas - remove not duplicated rows

I have DataFrame like this:

        product_id          dt  stock_qty
226870     2948259  2017-11-11     17.000
233645     2948259  2017-11-12     17.000
240572     2948260  2017-11-13      5.000
247452     2948260  2017-11-14      5.000
233644     2948260  2017-11-12      5.000
226869     2948260  2017-11-11      5.000
247451     2948262  2017-11-14     -2.000
226868     2948262  2017-11-11     -1.000  <- not duplicated
240571     2948262  2017-11-13     -2.000
240570     2948264  2017-11-13      5.488
233643     2948264  2017-11-12      5.488
244543     2948269  2017-11-11      2.500
247450     2948276  2017-11-14      3.250
226867     2948276  2017-11-11      3.250

I have to remove rows where stock_qty are different but product_id values are the same. So I should get DataFrame like this:

        product_id          dt  stock_qty
226870     2948259  2017-11-11     17.000
233645     2948259  2017-11-12     17.000
240572     2948260  2017-11-13      5.000
247452     2948260  2017-11-14      5.000
233644     2948260  2017-11-12      5.000
226869     2948260  2017-11-11      5.000
240570     2948264  2017-11-13      5.488
233643     2948264  2017-11-12      5.488
244543     2948269  2017-11-11      2.500
247450     2948276  2017-11-14      3.250
226867     2948276  2017-11-11      3.250

Thanks for help!

Upvotes: 4

Answers (4)

Danilo Matrangolo Marano

Reputation: 942

With loc[] you can filter just the duplicated rows and assign to your original dataframe.

df = df.loc[df.duplicated(subset=['product_id','stock_qty'], keep=False)]

Also the keep=False parameter mark all the duplicated rows as True, if you want just the first or the last use keep='first' or keep='last'

Upvotes: 0

jezrael

Reputation: 862471

You need drop_duplicates for get all product_id values and then exclude them by isin with another condition chained by xor (^):

m1 = df['product_id'].isin(df.drop_duplicates('stock_qty', keep=False)['product_id'])
m2 = df.duplicated('product_id', keep=False)

df = df[m1 ^ m2]
print (df)
        product_id          dt  stock_qty
226870     2948259  2017-11-11     17.000
233645     2948259  2017-11-12     17.000
240572     2948260  2017-11-13      5.000
247452     2948260  2017-11-14      5.000
233644     2948260  2017-11-12      5.000
226869     2948260  2017-11-11      5.000
240570     2948264  2017-11-13      5.488
233643     2948264  2017-11-12      5.488
244543     2948269  2017-11-11      2.500
247450     2948276  2017-11-14      3.250
226867     2948276  2017-11-11      3.250

Detail:

print (m1)
226870    False
233645    False
240572    False
247452    False
233644    False
226869    False
247451     True
226868     True
240571     True
240570    False
233643    False
244543     True
247450    False
226867    False
Name: product_id, dtype: bool

print (m2)
226870     True
233645     True
240572     True
247452     True
233644     True
226869     True
247451     True
226868     True
240571     True
240570     True
233643     True
244543    False
247450     True
226867     True
dtype: bool

Upvotes: 4

BENY

Reputation: 323226

By using drop_duplicates

df.drop(df.drop_duplicates(['stock_qty', 'product_id'], keep=False).index)
Out[797]: 
        product_id          dt  stock_qty
226870     2948259  2017-11-11     17.000
233645     2948259  2017-11-12     17.000
240572     2948260  2017-11-13      5.000
247452     2948260  2017-11-14      5.000
233644     2948260  2017-11-12      5.000
226869     2948260  2017-11-11      5.000
247451     2948262  2017-11-14     -2.000
240571     2948262  2017-11-13     -2.000
240570     2948264  2017-11-13      5.488
233643     2948264  2017-11-12      5.488
247450     2948276  2017-11-14      3.250
226867     2948276  2017-11-11      3.250

Upvotes: 4

Scott Boston

Reputation: 153460

@jezrael solution is optimal, but another way is using groupby and filter:

df.groupby(['product_id','stock_qty']).filter(lambda x: len(x)>1)

Output:

        product_id          dt  stock_qty
226870     2948259  2017-11-11     17.000
233645     2948259  2017-11-12     17.000
240572     2948260  2017-11-13      5.000
247452     2948260  2017-11-14      5.000
233644     2948260  2017-11-12      5.000
226869     2948260  2017-11-11      5.000
247451     2948262  2017-11-14     -2.000
240571     2948262  2017-11-13     -2.000
240570     2948264  2017-11-13      5.488
233643     2948264  2017-11-12      5.488
247450     2948276  2017-11-14      3.250
226867     2948276  2017-11-11      3.250

Upvotes: 4

Python/Pandas - remove not duplicated rows

Answers (4)

Related Questions