Identify the latest series of Continuous same value in Python Pandas DataFrame

Question

I have the following DataFrame (Date in dd-mm-yyyy format):

import pandas as pd
data={'Id':['A', 'B', 'C', 'A', 'B', 'C', 'B', 'C', 'A', 'C', 'B', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
  'Date':['20-10-2022', '20-10-2022', '20-10-2022', '21-10-2022', '21-10-2022', '21-10-2022',
          '22-10-2022', '22-10-2022', '23-10-2022', '23-10-2022', '24-10-2022', '24-10-2022',
          '25-10-2022', '25-10-2022', '26-10-2022', '26-10-2022', '26-10-2022', '27-10-2022',
          '27-10-2022', '27-10-2022']}
df=pd.DataFrame.from_dict(data)
df


   Id        Date
0   A  20-10-2022
1   B  20-10-2022
2   C  20-10-2022
3   A  21-10-2022
4   B  21-10-2022
5   C  21-10-2022
6   B  22-10-2022
7   C  22-10-2022
8   A  23-10-2022
9   C  23-10-2022
10  B  24-10-2022
11  C  24-10-2022
12  B  25-10-2022
13  C  25-10-2022
14  A  26-10-2022
15  B  26-10-2022
16  C  26-10-2022
17  A  27-10-2022
18  B  27-10-2022
19  C  27-10-2022

This is the Final DataFrame that I want:

I have tried the following code:

# Find first occurance and last occurance of any given Id.
df_first_duplicate = df.drop_duplicates(subset=['Id'], keep='first')
df_first_duplicate.rename(columns = {'Date':'DateOfFirstOccurance'}, inplace = True)
df_first_duplicate.reset_index(inplace = True, drop = True)
df_last_duplicate = df.drop_duplicates(subset=['Id'], keep='last')
df_last_duplicate.rename(columns = {'Date':'DateOfLastOccurance'}, inplace = True)
df_last_duplicate.reset_index(inplace = True, drop = True)
# Merge the above two df's on key
df_merged = pd.merge(df_first_duplicate, df_last_duplicate, on='Id')
df_merged

But this is the output that I get:

  Id DateOfFirstOccurance DateOfLastOccurance
0  A           20-10-2022          27-10-2022
1  B           20-10-2022          27-10-2022
2  C           20-10-2022          27-10-2022

What should I do to get the desired output?

jacl613 max31 · Accepted Answer

df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')

records = []
for key, group in df.groupby(by='Id'):
    filt = group['Date'].diff(-1).dt.days >= -1 
    filt.iloc[filt.shape[0]-1] = True
    max_false_index = filt[~filt].index.max()
    min_date = group['Date'].min() if type(max_false_index) == float else group.loc[max_false_index+1:, 'Date'].min()
    records.append([key, min_date, group['Date'].max()])
    
pd.DataFrame(records, columns=['Id', 'DateOfFirstOccurance', 'DateOfLastOccurance'])

Identify the latest series of Continuous same value in Python Pandas DataFrame

Answers (2)

Related Questions