Reputation: 2936
I have the following dataframe:
(Pdb++) df = pd.DataFrame([{'date': '2019-01-01', 'type': 'Q'},{'date': '2019-01-01', 'type': 'Y'},{'date': '2019-01-01', 'type': 'H'},{'date': '2019-01-02', 'type': 'Q'},{'date': '2019-01-02', 'type': 'Y'},{'date': '2019-01-03', 'type': 'H'},{'date': '2019-01-04', 'type': 'H'},{'date': '2019-01-04', 'type': 'Q'}]).set_index('date')
(Pdb++) df
type
date
2019-01-01 Q
2019-01-01 Y
2019-01-01 H
2019-01-02 Q
2019-01-02 Y
2019-01-03 H
2019-01-04 H
2019-01-04 Q
I want to delete duplicated dates in the index based on the duplicated rows' values in the type
column.
Q
, Y
and H
, delete the H
.Q
and Y
, delete nothing.Q
and H
, delete the H
.A simpler way of putting this is: In a group of duplicated rows, delete the H
, if any.
With this, the expected output of the above would be as follows: (2019-01-01, H)
and (2019-01-04, H)
removed
type
date
2019-01-01 Q
2019-01-01 Y
2019-01-02 Q
2019-01-02 Y
2019-01-03 H
2019-01-04 Q
I can't seem to get this quite right. My two attempts:
df.loc[(df['type'] == 'Y') | (df['type'] == 'Q') | ~df.index.duplicated(keep='last')]
df.loc[(df['type'] == 'Y') | (df['type'] == 'Q') | ~df.index.duplicated(keep='first')]
The first one fails to delete (2019-01-01, H)
and the second one fails to delete (2019-01-04, H)
.
I thought about sorting the type
column first to get the H
values first and then use keep='first'
, but that for some inexplicable reason gives me an error:
(Pdb++) df.sort_values('type').loc[(df['type'] == 'Y') | (df['type'] == 'Q') | ~df.index.duplicated(keep='first')]
*** ValueError: cannot reindex from a duplicate axis
I'm lost here. Any help?
Upvotes: 2
Views: 156
Reputation: 862761
Filter out of all H
values, if not groups with only values H
:
m = df.type.eq('H')
df = df[~m | m.groupby(level=0).transform('all')]
print (df)
type
date
2019-01-01 Q
2019-01-01 Y
2019-01-02 Q
2019-01-02 Y
2019-01-03 H
2019-01-04 Q
Detail:
print (m.groupby(level=0).transform('all'))
date
2019-01-01 False
2019-01-01 False
2019-01-01 False
2019-01-02 False
2019-01-02 False
2019-01-03 True
2019-01-04 False
2019-01-04 False
Name: type, dtype: bool
Upvotes: 1
Reputation: 75080
group the index to and take nunique, then filter the condition where size is greater than 1, also check if type
is H
and negate them:
s=df.groupby(df.index)['type'].transform('nunique')
df[~((s>1)&df.type.eq('H'))]
type
date
2019-01-01 Q
2019-01-01 Y
2019-01-02 Q
2019-01-02 Y
2019-01-03 H
2019-01-04 Q
Upvotes: 3