TwinPenguins
TwinPenguins

Reputation: 495

pandas drop consecutive duplicates selectively

I have been looking at all questions/answers about to how drop consecutive duplicates selectively in a pandas dataframe, still cannot figure out the following scenario:

import pandas as pd
import numpy as np

def random_dates(start, end, n, freq, seed=None):
    if seed is not None:
        np.random.seed(seed)

    dr = pd.date_range(start, end, freq=freq)
    return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))

date = random_dates('2018-01-01', '2018-01-12', 20, 'H', seed=[3, 1415])

data = {'Timestamp': date, 
        'Message': ['Message received.','Sending...', 'Sending...', 'Sending...', 'Work in progress...', 'Work in progress...', 
                    'Message received.','Sending...', 'Sending...','Work in progress...',
                    'Message received.','Sending...', 'Sending...', 'Sending...','Work in progress...', 'Work in progress...', 'Work in progress...',
                    'Message received.','Sending...', 'Sending...']}

df = pd.DataFrame(data, columns = ['Timestamp', 'Message'])

I have the following dataframe:

             Timestamp              Message
0  2018-01-02 03:00:00    Message received.
1  2018-01-02 11:00:00           Sending...
2  2018-01-03 04:00:00           Sending...
3  2018-01-04 11:00:00           Sending...
4  2018-01-04 16:00:00  Work in progress...
5  2018-01-04 17:00:00  Work in progress...
6  2018-01-05 05:00:00    Message received.
7  2018-01-05 11:00:00           Sending...
8  2018-01-05 17:00:00           Sending...
9  2018-01-06 02:00:00  Work in progress...
10 2018-01-06 14:00:00    Message received.
11 2018-01-07 07:00:00           Sending...
12 2018-01-07 20:00:00           Sending...
13 2018-01-08 01:00:00           Sending...
14 2018-01-08 02:00:00  Work in progress...
15 2018-01-08 15:00:00  Work in progress...
16 2018-01-09 00:00:00  Work in progress...
17 2018-01-10 03:00:00    Message received.
18 2018-01-10 09:00:00           Sending...
19 2018-01-10 14:00:00           Sending...

I want to drop the consecutive duplicates in df['Message'] column ONLY when 'Message' is 'Work in progress...' and keep the first instance (here e.g. Index 5, 15 and 16 need to be dropped), ideally I would like to get:

             Timestamp              Message
0  2018-01-02 03:00:00    Message received.
1  2018-01-02 11:00:00           Sending...
2  2018-01-03 04:00:00           Sending...
3  2018-01-04 11:00:00           Sending...
4  2018-01-04 16:00:00  Work in progress...
6  2018-01-05 05:00:00    Message received.
7  2018-01-05 11:00:00           Sending...
8  2018-01-05 17:00:00           Sending...
9  2018-01-06 02:00:00  Work in progress...
10 2018-01-06 14:00:00    Message received.
11 2018-01-07 07:00:00           Sending...
12 2018-01-07 20:00:00           Sending...
13 2018-01-08 01:00:00           Sending...
14 2018-01-08 02:00:00  Work in progress...
17 2018-01-10 03:00:00    Message received.
18 2018-01-10 09:00:00           Sending...
19 2018-01-10 14:00:00           Sending...

I have tried solutions offered in similar posts like:

df['Message'].loc[df['Message'].shift(-1) != df['Message']]

I also calculated the length of the Messages:

df['length'] = df['Message'].apply(lambda x: len(x))

and wrote a conditional drop as:

df.loc[(df['length'] ==17) | (df['length'] ==10) | ~df['Message'].duplicated(keep='first')]

It looks better but still Index 14, 15, and 16 are dropped altogether, thus it is ill-behaved, see:

             Timestamp              Message  length
0  2018-01-02 03:00:00    Message received.      17
1  2018-01-02 11:00:00           Sending...      10
2  2018-01-03 04:00:00           Sending...      10
3  2018-01-04 11:00:00           Sending...      10
4  2018-01-04 16:00:00  Work in progress...      19
6  2018-01-05 05:00:00    Message received.      17
7  2018-01-05 11:00:00           Sending...      10
8  2018-01-05 17:00:00           Sending...      10
10 2018-01-06 14:00:00    Message received.      17
11 2018-01-07 07:00:00           Sending...      10
12 2018-01-07 20:00:00           Sending...      10
13 2018-01-08 01:00:00           Sending...      10
17 2018-01-10 03:00:00    Message received.      17
18 2018-01-10 09:00:00           Sending...      10
19 2018-01-10 14:00:00           Sending...      10

Your time and help is appreciated!

Upvotes: 4

Views: 268

Answers (3)

G.G
G.G

Reputation: 765

(
df1.sql.select("*,lag(Message) over() col1")
.select("*,sum(coalesce((Message!='Work in progress...' or col1!='Work in progress...')::int,0)) over(order by index) col2")
.select("*,row_number() over(partition by col2) col3")
.filter("col3=1")
.order("index")
)

┌───────┬─────────────────────┬─────────────────────┬─────────────────────┬────────┬───────┐
│ index │      Timestamp      │       Message       │        col1         │  col2  │ col3  │
│ int64 │       varchar       │       varchar       │       varchar       │ int128 │ int64 │
├───────┼─────────────────────┼─────────────────────┼─────────────────────┼────────┼───────┤
│     0 │ 2018-01-02 03:00:00 │ Message received.   │ NULL                │      1 │     1 │
│     1 │ 2018-01-02 11:00:00 │ Sending...          │ Message received.   │      2 │     1 │
│     2 │ 2018-01-03 04:00:00 │ Sending...          │ Sending...          │      3 │     1 │
│     3 │ 2018-01-04 11:00:00 │ Sending...          │ Sending...          │      4 │     1 │
│     4 │ 2018-01-04 16:00:00 │ Work in progress... │ Sending...          │      5 │     1 │
│     6 │ 2018-01-05 05:00:00 │ Message received.   │ Work in progress... │      6 │     1 │
│     7 │ 2018-01-05 11:00:00 │ Sending...          │ Message received.   │      7 │     1 │
│     8 │ 2018-01-05 17:00:00 │ Sending...          │ Sending...          │      8 │     1 │
│     9 │ 2018-01-06 02:00:00 │ Work in progress... │ Sending...          │      9 │     1 │
│    10 │ 2018-01-06 14:00:00 │ Message received.   │ Work in progress... │     10 │     1 │
│    11 │ 2018-01-07 07:00:00 │ Sending...          │ Message received.   │     11 │     1 │
│    12 │ 2018-01-07 20:00:00 │ Sending...          │ Sending...          │     12 │     1 │
│    13 │ 2018-01-08 01:00:00 │ Sending...          │ Sending...          │     13 │     1 │
│    14 │ 2018-01-08 02:00:00 │ Work in progress... │ Sending...          │     14 │     1 │
│    17 │ 2018-01-10 03:00:00 │ Message received.   │ Work in progress... │     15 │     1 │
│    18 │ 2018-01-10 09:00:00 │ Sending...          │ Message received.   │     16 │     1 │
│    19 │ 2018-01-10 14:00:00 │ Sending...          │ Sending...          │     17 │     1 │
├───────┴─────────────────────┴─────────────────────┴─────────────────────┴────────┴───────┤
│ 17 rows                                                                        6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────┘

Upvotes: 0

Mohit Motwani
Mohit Motwani

Reputation: 4792

You can first get all Messages with 'Work in Progress' and compare them with the previous element and then filter:

condition = (df['Message'] == 'Work in progress...') & (df['Message']==df['Message'].shift(1))

df[~condition]

     Timestamp           Message
0   2018-01-02 03:00:00 Message received.
1   2018-01-02 11:00:00 Sending...
2   2018-01-03 04:00:00 Sending...
3   2018-01-04 11:00:00 Sending...
4   2018-01-04 16:00:00 Work in progress...
6   2018-01-05 05:00:00 Message received.
7   2018-01-05 11:00:00 Sending...
8   2018-01-05 17:00:00 Sending...
9   2018-01-06 02:00:00 Work in progress...
10  2018-01-06 14:00:00 Message received.
11  2018-01-07 07:00:00 Sending...
12  2018-01-07 20:00:00 Sending...
13  2018-01-08 01:00:00 Sending...
14  2018-01-08 02:00:00 Work in progress...
17  2018-01-10 03:00:00 Message received.
18  2018-01-10 09:00:00 Sending...
19  2018-01-10 14:00:00 Sending...

Upvotes: 2

jezrael
jezrael

Reputation: 863226

First filter first consecutive values with compare by Series.shift and chain mask with filter all rows with no Work in progress... values:

df = df[(df['Message'].shift() != df['Message']) | (df['Message'] != 'Work in progress...')]
print (df)
             Timestamp              Message
0  2018-01-02 03:00:00    Message received.
1  2018-01-02 11:00:00           Sending...
2  2018-01-03 04:00:00           Sending...
3  2018-01-04 11:00:00           Sending...
4  2018-01-04 16:00:00  Work in progress...
6  2018-01-05 05:00:00    Message received.
7  2018-01-05 11:00:00           Sending...
8  2018-01-05 17:00:00           Sending...
9  2018-01-06 02:00:00  Work in progress...
10 2018-01-06 14:00:00    Message received.
11 2018-01-07 07:00:00           Sending...
12 2018-01-07 20:00:00           Sending...
13 2018-01-08 01:00:00           Sending...
14 2018-01-08 02:00:00  Work in progress...
17 2018-01-10 03:00:00    Message received.
18 2018-01-10 09:00:00           Sending...
19 2018-01-10 14:00:00           Sending...

Upvotes: 3

Related Questions