Reputation: 6222
Remove rows
Rows corresponding to missing dates are assumed to be identical to the one before them - so once they data is sorted by ID and Date, there is no need to check if the dates are consecutive.
I guess one way to do this is to forget about grouping and consider identical values of columns C1, C2, and ID.
df = pd.DataFrame({'ID': ["A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B"],
'Date': [pd.Timestamp('2019-07-12 01:00:00'),
pd.Timestamp('2019-07-13 01:00:00'),
pd.Timestamp('2019-07-14 01:00:00'),
pd.Timestamp('2019-07-15 01:00:00'),
pd.Timestamp('2019-07-16 01:00:00'),
pd.Timestamp('2019-07-17 01:00:00'),
pd.Timestamp('2019-06-21 01:00:00'),
pd.Timestamp('2019-07-12 01:00:00'),
pd.Timestamp('2019-07-13 01:00:00'),
pd.Timestamp('2019-07-14 01:00:00'),
pd.Timestamp('2019-07-16 01:00:00'),
pd.Timestamp('2019-07-17 01:00:00'),
pd.Timestamp('2019-07-18 01:00:00')],
'C1':[1.0, 4, 4, 4, 4, 3, 3, 3, 3, 5, 5, 5, 3],
'C2':[3.0, 4, 4, 4, 3, 3, 3, 3, 5, 5, 5, 5, 3]})
The desired outcome is
ID Date C1 C2 0 A 2019-07-12 01:00:00 1.0 3.0 4 A 2019-07-16 01:00:00 4.0 3.0 5 A 2019-07-17 01:00:00 3.0 3.0 6 B 2019-06-21 01:00:00 3.0 3.0 7 B 2019-07-12 01:00:00 3.0 3.0 8 B 2019-07-13 01:00:00 3.0 5.0 12 B 2019-07-18 01:00:00 3.0 3.0
Upvotes: 0
Views: 55
Reputation: 1274
You need first with agg
Code:
import pandas as pd
df = pd.DataFrame({'ID': ["A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B"],
'Date': [pd.Timestamp('2019-07-12 01:00:00'),
pd.Timestamp('2019-07-13 01:00:00'),
pd.Timestamp('2019-07-14 01:00:00'),
pd.Timestamp('2019-07-15 01:00:00'),
pd.Timestamp('2019-07-16 01:00:00'),
pd.Timestamp('2019-07-17 01:00:00'),
pd.Timestamp('2019-06-21 01:00:00'),
pd.Timestamp('2019-07-12 01:00:00'),
pd.Timestamp('2019-07-13 01:00:00'),
pd.Timestamp('2019-07-14 01:00:00'),
pd.Timestamp('2019-07-16 01:00:00'),
pd.Timestamp('2019-07-17 01:00:00'),
pd.Timestamp('2019-07-18 01:00:00')],
'C1':[1.0, 4, 4, 4, 4, 3, 3, 3, 3, 5, 5, 5, 3],
'C2':[3.0, 4, 4, 4, 3, 3, 3, 3, 5, 5, 5, 5, 3]})
df2 = df.groupby(['ID', 'C1', 'C2']).agg({'Date' : 'first'}).reset_index()
print(df2)
Output:
ID C1 C2 Date
0 A 1.0 3.0 2019-07-12 01:00:00
1 A 3.0 3.0 2019-07-17 01:00:00
2 A 4.0 3.0 2019-07-16 01:00:00
3 A 4.0 4.0 2019-07-13 01:00:00
4 B 3.0 3.0 2019-06-21 01:00:00
5 B 3.0 5.0 2019-07-13 01:00:00
6 B 5.0 5.0 2019-07-14 01:00:00
Upvotes: 1
Reputation: 153460
You can try this logic:
grp = (df.groupby(['ID'])['C2'].diff() != 0).cumsum()
mask = df.groupby(grp)['C2'].transform(lambda x: (x.count() < 3) | (x.notna().cumsum() > 3))
df[mask]
Output:
ID Date C1 C2
0 A 2019-07-12 01:00:00 1.0 3.0
4 A 2019-07-16 01:00:00 4.0 3.0
5 A 2019-07-17 01:00:00 3.0 3.0
6 B 2019-06-21 01:00:00 3.0 3.0
7 B 2019-07-12 01:00:00 3.0 3.0
11 B 2019-07-17 01:00:00 5.0 5.0
12 B 2019-07-18 01:00:00 3.0 3.0
Details:
First, let's groupby ID and calculate the diff between consective rows, now where diff is not equal to 1, then use cumsum to define a group of consective same value.
Next, groupby the dataframe, df using the groups defined by step one. Using transform on any column without NaNs, you can the get the size of the group and using notna and cumsum to get a running count.
You should break this coded and look at the outputs of each step.
Upvotes: 2