Reputation: 6222

remove consecutive identical rows by group

Remove rows

with at least n (n = 3) consecutive identical values
in all columns C1 and C2
rows also must have the same the groups (ID), and
they must be ordered by the Date.

Rows corresponding to missing dates are assumed to be identical to the one before them - so once they data is sorted by ID and Date, there is no need to check if the dates are consecutive.

I guess one way to do this is to forget about grouping and consider identical values of columns C1, C2, and ID.

df = pd.DataFrame({'ID': ["A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B"], 
                   'Date':  [pd.Timestamp('2019-07-12 01:00:00'),
                             pd.Timestamp('2019-07-13 01:00:00'),
                             pd.Timestamp('2019-07-14 01:00:00'),
                             pd.Timestamp('2019-07-15 01:00:00'),
                             pd.Timestamp('2019-07-16 01:00:00'),
                             pd.Timestamp('2019-07-17 01:00:00'), 
                             pd.Timestamp('2019-06-21 01:00:00'), 
                             pd.Timestamp('2019-07-12 01:00:00'),
                             pd.Timestamp('2019-07-13 01:00:00'),
                             pd.Timestamp('2019-07-14 01:00:00'),
                             pd.Timestamp('2019-07-16 01:00:00'),
                             pd.Timestamp('2019-07-17 01:00:00'),
                             pd.Timestamp('2019-07-18 01:00:00')],
                   'C1':[1.0, 4, 4, 4, 4, 3, 3, 3, 3, 5, 5, 5, 3],
                   'C2':[3.0, 4, 4, 4, 3, 3, 3, 3, 5, 5, 5, 5, 3]})

The desired outcome is

   ID                Date   C1   C2
0   A 2019-07-12 01:00:00  1.0  3.0
4   A 2019-07-16 01:00:00  4.0  3.0
5   A 2019-07-17 01:00:00  3.0  3.0
6   B 2019-06-21 01:00:00  3.0  3.0
7   B 2019-07-12 01:00:00  3.0  3.0
8   B 2019-07-13 01:00:00  3.0  5.0
12  B 2019-07-18 01:00:00  3.0  3.0

Upvotes: 0

Answers (2)

Aaj Kaal

Reputation: 1274

You need first with agg

Code:

import pandas as pd

df = pd.DataFrame({'ID': ["A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B"], 
                   'Date':  [pd.Timestamp('2019-07-12 01:00:00'),
                             pd.Timestamp('2019-07-13 01:00:00'),
                             pd.Timestamp('2019-07-14 01:00:00'),
                             pd.Timestamp('2019-07-15 01:00:00'),
                             pd.Timestamp('2019-07-16 01:00:00'),
                             pd.Timestamp('2019-07-17 01:00:00'), 
                             pd.Timestamp('2019-06-21 01:00:00'), 
                             pd.Timestamp('2019-07-12 01:00:00'),
                             pd.Timestamp('2019-07-13 01:00:00'),
                             pd.Timestamp('2019-07-14 01:00:00'),
                             pd.Timestamp('2019-07-16 01:00:00'),
                             pd.Timestamp('2019-07-17 01:00:00'),
                             pd.Timestamp('2019-07-18 01:00:00')],
                   'C1':[1.0, 4, 4, 4, 4, 3, 3, 3, 3, 5, 5, 5, 3],
                   'C2':[3.0, 4, 4, 4, 3, 3, 3, 3, 5, 5, 5, 5, 3]})

df2 = df.groupby(['ID', 'C1', 'C2']).agg({'Date' : 'first'}).reset_index()
print(df2)

Output:

  ID   C1   C2                Date
0  A  1.0  3.0 2019-07-12 01:00:00
1  A  3.0  3.0 2019-07-17 01:00:00
2  A  4.0  3.0 2019-07-16 01:00:00
3  A  4.0  4.0 2019-07-13 01:00:00
4  B  3.0  3.0 2019-06-21 01:00:00
5  B  3.0  5.0 2019-07-13 01:00:00
6  B  5.0  5.0 2019-07-14 01:00:00

Upvotes: 1

Scott Boston

Reputation: 153460

You can try this logic:

grp = (df.groupby(['ID'])['C2'].diff() != 0).cumsum()

mask = df.groupby(grp)['C2'].transform(lambda x: (x.count() < 3) | (x.notna().cumsum() > 3))

df[mask]

Output:

  ID                Date   C1   C2
0   A 2019-07-12 01:00:00  1.0  3.0
4   A 2019-07-16 01:00:00  4.0  3.0
5   A 2019-07-17 01:00:00  3.0  3.0
6   B 2019-06-21 01:00:00  3.0  3.0
7   B 2019-07-12 01:00:00  3.0  3.0
11  B 2019-07-17 01:00:00  5.0  5.0
12  B 2019-07-18 01:00:00  3.0  3.0

Details:

First, let's groupby ID and calculate the diff between consective rows, now where diff is not equal to 1, then use cumsum to define a group of consective same value.

Next, groupby the dataframe, df using the groups defined by step one. Using transform on any column without NaNs, you can the get the size of the group and using notna and cumsum to get a running count.

You should break this coded and look at the outputs of each step.

Upvotes: 2

remove consecutive identical rows by group

Answers (2)

Related Questions