Reputation: 4298
I'm trying to figure out whether any ID
has occurred in any of the earlier years (i.e. Duplicate
column in dfo
). If so, I'd want to mark the row as duplicate and include the year in which the ID
first occurred (i.e. Year_Duplicate
).
I do have a working code.
Objective: I want to learn better (or 'pythonic') way to solve this problem in a better way i.e. if there is more condense way to solve it, I'd appreciate any help. I'm not too familiar with all the features we get with
numpy
andpandas
Sample Input
dfi.to_dict() =
{'Year': {0: 2020,
1: 2020,
2: 2020,
3: 2021,
4: 2021,
5: 2021,
6: 2022,
7: 2022,
8: 2022},
'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
'$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3}}
Sample Output:
dfo.to_dict()
{'Year': {0: 2020,
1: 2020,
2: 2020,
3: 2021,
4: 2021,
5: 2021,
6: 2022,
7: 2022,
8: 2022},
'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
'$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
'Duplicate': {0: False,
1: False,
2: False,
3: True,
4: False,
5: True,
6: False,
7: True,
8: True},
'Year_Duplicate': {0: nan,
1: nan,
2: nan,
3: 2020.0,
4: nan,
5: 2020.0,
6: nan,
7: 2020.0,
8: 2021.0}}
Working code:
import pandas as pd
from numpy import nan as NA
dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)
df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()
indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})
df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA
dfo.equals(df_process) #returns TRUE
I'll be happy to answer any clarifications. Thanks for helping me.
Clarification from comments below:
$
is just a number indicating Sales. It could be ignored for
duplication. Year_Duplicate
shows us the first year when that ID
occurred. If there is no duplicate, there is no need for
Year_Duplicate
in which case we will leave it blank.Upvotes: 2
Views: 96
Reputation: 862751
Use Series.duplicated
with Series.where
and GroupBy.transform
with GroupBy.first
:
df['Year_Duplicated']=df.groupby('ID')['Year'].transform('first').where(df['ID'].duplicated())
print (df)
Year ID $ Year_Duplicated
0 2020 1 1 NaN
1 2020 2 1 NaN
2 2020 3 1 NaN
3 2021 1 2 2020.0
4 2021 4 2 NaN
5 2021 2 2 2020.0
6 2022 5 3 NaN
7 2022 1 3 2020.0
8 2022 4 3 2021.0
Detail:
print (df.groupby('ID')['Year'].transform('first'))
0 2020
1 2020
2 2020
3 2020
4 2021
5 2020
6 2022
7 2020
8 2021
Name: Year, dtype: int64
Upvotes: 2
Reputation: 993
This produces the column Year_Duplicate
the way it is indicated in DataFrame dfo
:
dfi['Duplicate'] = dfi.duplicated(subset='ID', keep='first')
first_year = dfi.groupby('ID')['Year'].first()
dfi['Year_Duplicate'] = dfi.loc[dfi['Duplicate'], 'ID'].map(first_year)
Output
Year ID $ Duplicate Year_Duplicate
0 2020 1 1 False NaN
1 2020 2 1 False NaN
2 2020 3 1 False NaN
3 2021 1 2 True 2020.0
4 2021 4 2 False NaN
5 2021 2 2 True 2020.0
6 2022 5 3 False NaN
7 2022 1 3 True 2020.0
8 2022 4 3 True 2021.0
dfo.equals(dfi) #True
Upvotes: 1
Reputation: 150745
You can use groupby().cumcount
:
df['Duplicated'] = df.groupby('ID')['Year'].cumcount().gt(0)
df['Year_Duplicated'] = df['Year'].where(df['Duplicated'])
Output:
Year ID $ Duplicated Year_Duplicated
0 2020 1 1 False NaN
1 2020 2 1 False NaN
2 2020 3 1 False NaN
3 2021 1 2 True 2021.0
4 2021 4 2 False NaN
5 2021 2 2 True 2021.0
6 2022 5 3 False NaN
7 2022 1 3 True 2022.0
8 2022 4 3 True 2022.0
Upvotes: 1