Jokab
Jokab

Reputation: 2936

Conditionally keep duplicates based on column value

I have the following dataframe:

(Pdb++) df = pd.DataFrame([{'date': '2019-01-01', 'type': 'Q'},{'date': '2019-01-01', 'type': 'Y'},{'date': '2019-01-01', 'type': 'H'},{'date': '2019-01-02', 'type': 'Q'},{'date': '2019-01-02', 'type': 'Y'},{'date': '2019-01-03', 'type': 'H'},{'date': '2019-01-04', 'type': 'H'},{'date': '2019-01-04', 'type': 'Q'}]).set_index('date')
(Pdb++) df
           type
date           
2019-01-01    Q
2019-01-01    Y
2019-01-01    H
2019-01-02    Q
2019-01-02    Y
2019-01-03    H
2019-01-04    H
2019-01-04    Q

I want to delete duplicated dates in the index based on the duplicated rows' values in the type column.

Requirements

A simpler way of putting this is: In a group of duplicated rows, delete the H, if any.

Expected output

With this, the expected output of the above would be as follows: (2019-01-01, H) and (2019-01-04, H) removed

           type
date           
2019-01-01    Q
2019-01-01    Y
2019-01-02    Q
2019-01-02    Y
2019-01-03    H
2019-01-04    Q

Attempts

I can't seem to get this quite right. My two attempts:

df.loc[(df['type'] == 'Y') | (df['type'] == 'Q') | ~df.index.duplicated(keep='last')]

df.loc[(df['type'] == 'Y') | (df['type'] == 'Q') | ~df.index.duplicated(keep='first')]

The first one fails to delete (2019-01-01, H) and the second one fails to delete (2019-01-04, H).

I thought about sorting the type column first to get the H values first and then use keep='first', but that for some inexplicable reason gives me an error:

(Pdb++) df.sort_values('type').loc[(df['type'] == 'Y') | (df['type'] == 'Q') | ~df.index.duplicated(keep='first')]
*** ValueError: cannot reindex from a duplicate axis

I'm lost here. Any help?

Upvotes: 2

Views: 156

Answers (2)

jezrael
jezrael

Reputation: 862761

Filter out of all H values, if not groups with only values H:

m = df.type.eq('H')
df = df[~m | m.groupby(level=0).transform('all')]
print (df)
           type
date           
2019-01-01    Q
2019-01-01    Y
2019-01-02    Q
2019-01-02    Y
2019-01-03    H
2019-01-04    Q

Detail:

print (m.groupby(level=0).transform('all'))
date
2019-01-01    False
2019-01-01    False
2019-01-01    False
2019-01-02    False
2019-01-02    False
2019-01-03     True
2019-01-04    False
2019-01-04    False
Name: type, dtype: bool

Upvotes: 1

anky
anky

Reputation: 75080

group the index to and take nunique, then filter the condition where size is greater than 1, also check if type is H and negate them:

s=df.groupby(df.index)['type'].transform('nunique')
df[~((s>1)&df.type.eq('H'))]

           type
date           
2019-01-01    Q
2019-01-01    Y
2019-01-02    Q
2019-01-02    Y
2019-01-03    H
2019-01-04    Q

Upvotes: 3

Related Questions