Conditionally keep duplicates based on column value

Question

I have the following dataframe:

(Pdb++) df = pd.DataFrame([{'date': '2019-01-01', 'type': 'Q'},{'date': '2019-01-01', 'type': 'Y'},{'date': '2019-01-01', 'type': 'H'},{'date': '2019-01-02', 'type': 'Q'},{'date': '2019-01-02', 'type': 'Y'},{'date': '2019-01-03', 'type': 'H'},{'date': '2019-01-04', 'type': 'H'},{'date': '2019-01-04', 'type': 'Q'}]).set_index('date')
(Pdb++) df
           type
date           
2019-01-01    Q
2019-01-01    Y
2019-01-01    H
2019-01-02    Q
2019-01-02    Y
2019-01-03    H
2019-01-04    H
2019-01-04    Q

I want to delete duplicated dates in the index based on the duplicated rows' values in the type column.

Requirements

If a duplicated date contains Q, Y and H, delete the H.
If a duplicated date contains Q and Y, delete nothing.
If a duplicated date contains Q and H, delete the H.

A simpler way of putting this is: In a group of duplicated rows, delete the H, if any.

Expected output

With this, the expected output of the above would be as follows: (2019-01-01, H) and (2019-01-04, H) removed

           type
date           
2019-01-01    Q
2019-01-01    Y
2019-01-02    Q
2019-01-02    Y
2019-01-03    H
2019-01-04    Q

Attempts

I can't seem to get this quite right. My two attempts:

df.loc[(df['type'] == 'Y') | (df['type'] == 'Q') | ~df.index.duplicated(keep='last')]

df.loc[(df['type'] == 'Y') | (df['type'] == 'Q') | ~df.index.duplicated(keep='first')]

The first one fails to delete (2019-01-01, H) and the second one fails to delete (2019-01-04, H).

I thought about sorting the type column first to get the H values first and then use keep='first', but that for some inexplicable reason gives me an error:

(Pdb++) df.sort_values('type').loc[(df['type'] == 'Y') | (df['type'] == 'Q') | ~df.index.duplicated(keep='first')]
*** ValueError: cannot reindex from a duplicate axis

I'm lost here. Any help?

anky · Accepted Answer

group the index to and take nunique, then filter the condition where size is greater than 1, also check if type is H and negate them:

s=df.groupby(df.index)['type'].transform('nunique')
df[~((s>1)&df.type.eq('H'))]

           type
date           
2019-01-01    Q
2019-01-01    Y
2019-01-02    Q
2019-01-02    Y
2019-01-03    H
2019-01-04    Q

Conditionally keep duplicates based on column value

Requirements

Expected output

Attempts

Answers (2)

Related Questions