L.Stefan
L.Stefan

Reputation: 346

Change values of a column based on most frequently value on other column

I have an pandas.DataFrame like that:

expl = {'label':['A','A','A','B','B','B','C','C','C','C','C','A','A'],
'predicted':[1,1,3,2,2,1,3,2,3,4,4,-1,-2]}

        }
myDf = DataFrame(expl)

I wanna to create a new column with the values on 'predicted' column but "renamed" with the most frequently values on 'predicted' from each 'label'.

I'm currently using the following code which works as expected:

for group in myDf['predicted'].unique():
    if group >=0:
        dfg  = myDf.loc[myDf['predicted']==group, ['label']]
        rename = myDf.loc[myDf['predicted']==group, ['label']].groupby(['label']).size().idxmax()
        myDf.loc[myDf['predicted']==group,'renamed']= rename
    else:
        myDf.loc[myDf['predicted']==group,'renamed']= 'NA'

But I don't think it's the best way to do this. Is there any way to do this better?(in speed or memory)

I also think in save the most frequently values on a dict and use myDf['predict'].apply(lambda v: d[v] ) to replace the values, but doesn't seem to be faster.

Note: Negative values will be considered Not Available('NA').

Expected Output:

    label   predicted   renamed
0   A   1   A
1   A   1   A
2   A   3   C
3   B   2   B
4   B   2   B
5   B   1   A
6   C   3   C
7   C   2   B
8   C   3   C
9   C   4   C
10  C   4   C
11  A   -1  NA
12  A   -2  NA 

Upvotes: 0

Views: 78

Answers (1)

Anurag Dabas
Anurag Dabas

Reputation: 24314

Try groupby()+transform() and calculate the mode and use mask() to create NaN where 'predicted' is negative number:

myDf['renamed']=(myDf.groupby('predicted')['label']
                    .transform(lambda x:x.mode().iloc[0])
                    .mask(myDf['predicted'].le(-1)))

Upvotes: 1

Related Questions