Reputation: 346
I have an pandas.DataFrame
like that:
expl = {'label':['A','A','A','B','B','B','C','C','C','C','C','A','A'],
'predicted':[1,1,3,2,2,1,3,2,3,4,4,-1,-2]}
}
myDf = DataFrame(expl)
I wanna to create a new column with the values on 'predicted' column but "renamed" with the most frequently values on 'predicted' from each 'label'.
I'm currently using the following code which works as expected:
for group in myDf['predicted'].unique():
if group >=0:
dfg = myDf.loc[myDf['predicted']==group, ['label']]
rename = myDf.loc[myDf['predicted']==group, ['label']].groupby(['label']).size().idxmax()
myDf.loc[myDf['predicted']==group,'renamed']= rename
else:
myDf.loc[myDf['predicted']==group,'renamed']= 'NA'
But I don't think it's the best way to do this. Is there any way to do this better?(in speed or memory)
I also think in save the most frequently values on a dict
and use myDf['predict'].apply(lambda v: d[v] )
to replace the values, but doesn't seem to be faster.
Note: Negative values will be considered Not Available('NA'
).
Expected Output:
label predicted renamed
0 A 1 A
1 A 1 A
2 A 3 C
3 B 2 B
4 B 2 B
5 B 1 A
6 C 3 C
7 C 2 B
8 C 3 C
9 C 4 C
10 C 4 C
11 A -1 NA
12 A -2 NA
Upvotes: 0
Views: 78
Reputation: 24314
Try groupby()
+transform()
and calculate the mode and use mask()
to create NaN where 'predicted' is negative number:
myDf['renamed']=(myDf.groupby('predicted')['label']
.transform(lambda x:x.mode().iloc[0])
.mask(myDf['predicted'].le(-1)))
Upvotes: 1