Reputation: 4298

Mark rolling duplicate by previous value (year) in a grouping variable

I'm trying to figure out whether any ID has occurred in any of the earlier years (i.e. Duplicate column in dfo). If so, I'd want to mark the row as duplicate and include the year in which the ID first occurred (i.e. Year_Duplicate).

I do have a working code.

Objective: I want to learn better (or 'pythonic') way to solve this problem in a better way i.e. if there is more condense way to solve it, I'd appreciate any help. I'm not too familiar with all the features we get with numpy and pandas

Sample Input

dfi.to_dict() = 
{'Year': {0: 2020,
  1: 2020,
  2: 2020,
  3: 2021,
  4: 2021,
  5: 2021,
  6: 2022,
  7: 2022,
  8: 2022},
 'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
 '$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3}}

Sample Output:

dfo.to_dict()
{'Year': {0: 2020,
  1: 2020,
  2: 2020,
  3: 2021,
  4: 2021,
  5: 2021,
  6: 2022,
  7: 2022,
  8: 2022},
 'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
 '$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
 'Duplicate': {0: False,
  1: False,
  2: False,
  3: True,
  4: False,
  5: True,
  6: False,
  7: True,
  8: True},
 'Year_Duplicate': {0: nan,
  1: nan,
  2: nan,
  3: 2020.0,
  4: nan,
  5: 2020.0,
  6: nan,
  7: 2020.0,
  8: 2021.0}}

Working code:

import pandas as pd
from numpy import nan as NA

dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)

df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()

indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})

df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA

dfo.equals(df_process) #returns TRUE

I'll be happy to answer any clarifications. Thanks for helping me.

Clarification from comments below:

$ is just a number indicating Sales. It could be ignored for duplication.
Year_Duplicate shows us the first year when that ID occurred. If there is no duplicate, there is no need for Year_Duplicate in which case we will leave it blank.

Upvotes: 2

Answers (3)

jezrael

Reputation: 862751

Use Series.duplicated with Series.where and GroupBy.transform with GroupBy.first:

df['Year_Duplicated']=df.groupby('ID')['Year'].transform('first').where(df['ID'].duplicated())
print (df)
   Year  ID  $  Year_Duplicated
0  2020   1  1              NaN
1  2020   2  1              NaN
2  2020   3  1              NaN
3  2021   1  2           2020.0
4  2021   4  2              NaN
5  2021   2  2           2020.0
6  2022   5  3              NaN
7  2022   1  3           2020.0
8  2022   4  3           2021.0

Detail:

print (df.groupby('ID')['Year'].transform('first'))
0    2020
1    2020
2    2020
3    2020
4    2021
5    2020
6    2022
7    2020
8    2021
Name: Year, dtype: int64

Upvotes: 2

WolfgangK

Reputation: 993

This produces the column Year_Duplicate the way it is indicated in DataFrame dfo:

dfi['Duplicate'] = dfi.duplicated(subset='ID', keep='first')
first_year = dfi.groupby('ID')['Year'].first()
dfi['Year_Duplicate'] = dfi.loc[dfi['Duplicate'], 'ID'].map(first_year)

Output

   Year  ID  $  Duplicate  Year_Duplicate
0  2020   1  1      False             NaN
1  2020   2  1      False             NaN
2  2020   3  1      False             NaN
3  2021   1  2       True          2020.0
4  2021   4  2      False             NaN
5  2021   2  2       True          2020.0
6  2022   5  3      False             NaN
7  2022   1  3       True          2020.0
8  2022   4  3       True          2021.0

dfo.equals(dfi) #True

Upvotes: 1

Quang Hoang

Reputation: 150745

You can use groupby().cumcount:

df['Duplicated'] = df.groupby('ID')['Year'].cumcount().gt(0)
df['Year_Duplicated'] = df['Year'].where(df['Duplicated'])

Output:

    Year  ID  $  Duplicated  Year_Duplicated
0  2020   1  1       False              NaN
1  2020   2  1       False              NaN
2  2020   3  1       False              NaN
3  2021   1  2        True           2021.0
4  2021   4  2       False              NaN
5  2021   2  2        True           2021.0
6  2022   5  3       False              NaN
7  2022   1  3        True           2022.0
8  2022   4  3        True           2022.0

Upvotes: 1

Mark rolling duplicate by previous value (year) in a grouping variable

Answers (3)

Related Questions