Pavan
Pavan

Reputation: 239

Identify the latest series of Continuous same value in Python Pandas DataFrame

I have the following DataFrame (Date in dd-mm-yyyy format):

import pandas as pd
data={'Id':['A', 'B', 'C', 'A', 'B', 'C', 'B', 'C', 'A', 'C', 'B', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
  'Date':['20-10-2022', '20-10-2022', '20-10-2022', '21-10-2022', '21-10-2022', '21-10-2022',
          '22-10-2022', '22-10-2022', '23-10-2022', '23-10-2022', '24-10-2022', '24-10-2022',
          '25-10-2022', '25-10-2022', '26-10-2022', '26-10-2022', '26-10-2022', '27-10-2022',
          '27-10-2022', '27-10-2022']}
df=pd.DataFrame.from_dict(data)
df


   Id        Date
0   A  20-10-2022
1   B  20-10-2022
2   C  20-10-2022
3   A  21-10-2022
4   B  21-10-2022
5   C  21-10-2022
6   B  22-10-2022
7   C  22-10-2022
8   A  23-10-2022
9   C  23-10-2022
10  B  24-10-2022
11  C  24-10-2022
12  B  25-10-2022
13  C  25-10-2022
14  A  26-10-2022
15  B  26-10-2022
16  C  26-10-2022
17  A  27-10-2022
18  B  27-10-2022
19  C  27-10-2022

This is the Final DataFrame that I want:

Final DataFrame that I want

I have tried the following code:

# Find first occurance and last occurance of any given Id.
df_first_duplicate = df.drop_duplicates(subset=['Id'], keep='first')
df_first_duplicate.rename(columns = {'Date':'DateOfFirstOccurance'}, inplace = True)
df_first_duplicate.reset_index(inplace = True, drop = True)
df_last_duplicate = df.drop_duplicates(subset=['Id'], keep='last')
df_last_duplicate.rename(columns = {'Date':'DateOfLastOccurance'}, inplace = True)
df_last_duplicate.reset_index(inplace = True, drop = True)
# Merge the above two df's on key
df_merged = pd.merge(df_first_duplicate, df_last_duplicate, on='Id')
df_merged

But this is the output that I get:

  Id DateOfFirstOccurance DateOfLastOccurance
0  A           20-10-2022          27-10-2022
1  B           20-10-2022          27-10-2022
2  C           20-10-2022          27-10-2022

What should I do to get the desired output?

Upvotes: 1

Views: 129

Answers (2)

jacl613 max31
jacl613 max31

Reputation: 306

df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')

records = []
for key, group in df.groupby(by='Id'):
    filt = group['Date'].diff(-1).dt.days >= -1 
    filt.iloc[filt.shape[0]-1] = True
    max_false_index = filt[~filt].index.max()
    min_date = group['Date'].min() if type(max_false_index) == float else group.loc[max_false_index+1:, 'Date'].min()
    records.append([key, min_date, group['Date'].max()])
    
pd.DataFrame(records, columns=['Id', 'DateOfFirstOccurance', 'DateOfLastOccurance'])

Upvotes: 1

Rabinzel
Rabinzel

Reputation: 7923

Here is one way to do it.

Sort your data by Id and Date. Use pandas.Series.diff to get the difference of each row compared to the last one, change it with dt.days to a floating number and create a boolean Series by comparing each value if it is greater/equal to 1. Convert the boolean Series from True/False to 1/0 with astype(int) and build the cumulative sum. The idx with the biggest value is the first/last occurence of your data.

df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
df = df.sort_values(['Id', 'Date'])

out = (
    df
    .groupby('Id')['Date']
    .agg(
        first_occurence = lambda x: x[
            (x.diff().dt.days>1)
            .astype(int)
            .cumsum()
            .idxmax()
        ],
        last_occurence = lambda x: x[
            (x.diff().dt.days==1)
            .astype(int)
            .cumsum()
            .idxmax()
        ],
    )
)

print(out)

Upvotes: 1

Related Questions