Reputation: 495
I have been looking at all questions/answers about to how drop consecutive duplicates selectively in a pandas dataframe, still cannot figure out the following scenario:
import pandas as pd
import numpy as np
def random_dates(start, end, n, freq, seed=None):
if seed is not None:
np.random.seed(seed)
dr = pd.date_range(start, end, freq=freq)
return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))
date = random_dates('2018-01-01', '2018-01-12', 20, 'H', seed=[3, 1415])
data = {'Timestamp': date,
'Message': ['Message received.','Sending...', 'Sending...', 'Sending...', 'Work in progress...', 'Work in progress...',
'Message received.','Sending...', 'Sending...','Work in progress...',
'Message received.','Sending...', 'Sending...', 'Sending...','Work in progress...', 'Work in progress...', 'Work in progress...',
'Message received.','Sending...', 'Sending...']}
df = pd.DataFrame(data, columns = ['Timestamp', 'Message'])
I have the following dataframe:
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
5 2018-01-04 17:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
15 2018-01-08 15:00:00 Work in progress...
16 2018-01-09 00:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...
I want to drop the consecutive duplicates in df['Message'] column ONLY when 'Message' is 'Work in progress...' and keep the first instance (here e.g. Index 5, 15 and 16 need to be dropped), ideally I would like to get:
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...
I have tried solutions offered in similar posts like:
df['Message'].loc[df['Message'].shift(-1) != df['Message']]
I also calculated the length of the Messages:
df['length'] = df['Message'].apply(lambda x: len(x))
and wrote a conditional drop as:
df.loc[(df['length'] ==17) | (df['length'] ==10) | ~df['Message'].duplicated(keep='first')]
It looks better but still Index 14, 15, and 16 are dropped altogether, thus it is ill-behaved, see:
Timestamp Message length
0 2018-01-02 03:00:00 Message received. 17
1 2018-01-02 11:00:00 Sending... 10
2 2018-01-03 04:00:00 Sending... 10
3 2018-01-04 11:00:00 Sending... 10
4 2018-01-04 16:00:00 Work in progress... 19
6 2018-01-05 05:00:00 Message received. 17
7 2018-01-05 11:00:00 Sending... 10
8 2018-01-05 17:00:00 Sending... 10
10 2018-01-06 14:00:00 Message received. 17
11 2018-01-07 07:00:00 Sending... 10
12 2018-01-07 20:00:00 Sending... 10
13 2018-01-08 01:00:00 Sending... 10
17 2018-01-10 03:00:00 Message received. 17
18 2018-01-10 09:00:00 Sending... 10
19 2018-01-10 14:00:00 Sending... 10
Your time and help is appreciated!
Upvotes: 4
Views: 268
Reputation: 765
(
df1.sql.select("*,lag(Message) over() col1")
.select("*,sum(coalesce((Message!='Work in progress...' or col1!='Work in progress...')::int,0)) over(order by index) col2")
.select("*,row_number() over(partition by col2) col3")
.filter("col3=1")
.order("index")
)
┌───────┬─────────────────────┬─────────────────────┬─────────────────────┬────────┬───────┐
│ index │ Timestamp │ Message │ col1 │ col2 │ col3 │
│ int64 │ varchar │ varchar │ varchar │ int128 │ int64 │
├───────┼─────────────────────┼─────────────────────┼─────────────────────┼────────┼───────┤
│ 0 │ 2018-01-02 03:00:00 │ Message received. │ NULL │ 1 │ 1 │
│ 1 │ 2018-01-02 11:00:00 │ Sending... │ Message received. │ 2 │ 1 │
│ 2 │ 2018-01-03 04:00:00 │ Sending... │ Sending... │ 3 │ 1 │
│ 3 │ 2018-01-04 11:00:00 │ Sending... │ Sending... │ 4 │ 1 │
│ 4 │ 2018-01-04 16:00:00 │ Work in progress... │ Sending... │ 5 │ 1 │
│ 6 │ 2018-01-05 05:00:00 │ Message received. │ Work in progress... │ 6 │ 1 │
│ 7 │ 2018-01-05 11:00:00 │ Sending... │ Message received. │ 7 │ 1 │
│ 8 │ 2018-01-05 17:00:00 │ Sending... │ Sending... │ 8 │ 1 │
│ 9 │ 2018-01-06 02:00:00 │ Work in progress... │ Sending... │ 9 │ 1 │
│ 10 │ 2018-01-06 14:00:00 │ Message received. │ Work in progress... │ 10 │ 1 │
│ 11 │ 2018-01-07 07:00:00 │ Sending... │ Message received. │ 11 │ 1 │
│ 12 │ 2018-01-07 20:00:00 │ Sending... │ Sending... │ 12 │ 1 │
│ 13 │ 2018-01-08 01:00:00 │ Sending... │ Sending... │ 13 │ 1 │
│ 14 │ 2018-01-08 02:00:00 │ Work in progress... │ Sending... │ 14 │ 1 │
│ 17 │ 2018-01-10 03:00:00 │ Message received. │ Work in progress... │ 15 │ 1 │
│ 18 │ 2018-01-10 09:00:00 │ Sending... │ Message received. │ 16 │ 1 │
│ 19 │ 2018-01-10 14:00:00 │ Sending... │ Sending... │ 17 │ 1 │
├───────┴─────────────────────┴─────────────────────┴─────────────────────┴────────┴───────┤
│ 17 rows 6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────┘
Upvotes: 0
Reputation: 4792
You can first get all Messages with 'Work in Progress' and compare them with the previous element and then filter:
condition = (df['Message'] == 'Work in progress...') & (df['Message']==df['Message'].shift(1))
df[~condition]
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...
Upvotes: 2
Reputation: 863226
First filter first consecutive values with compare by Series.shift
and chain mask with filter all rows with no Work in progress...
values:
df = df[(df['Message'].shift() != df['Message']) | (df['Message'] != 'Work in progress...')]
print (df)
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...
Upvotes: 3