pandas dataframe duplicate values count not properly working

Question

value count is : df['ID'].value_counts().values -----> array([4,3,3,1], dtype=int64)

input:

ID emp
a  1
a  1
b  1
a  1
b  1
c  1
c  1
a  1
b  1
c  1
d  1

when I jumble the ID column

df.loc[~df.duplicated(keep='first', subset=['ID']), 'emp']= df['ID'].value_counts().values

output:

ID emp 
a  4
c  3
d  3
c  1
b  1
a  1
c  1
a  1
b  1
b  1
a  1

expected result:

ID emp 
a  4
c  3
d  1
c  1
b  3
a  1
c  1
a  1
b  1
b  1
a  1

problem :the count is not checking the ID before assigning it the emp.

jezrael · Accepted Answer

Here is problem ouput of df['ID'].value_counts() is Series with counted values in different number of values like original data, for new column filled by couter value use Series.map:

df.loc[~df.duplicated(subset=['ID']), 'emp'] = df['ID'].map(df['ID'].value_counts())

Or GroupBy.transform with size:

df.loc[~df.duplicated(subset=['ID']), 'emp'] = df.groupby('ID')['ID'].transform('size')

Output Series with 4 values cannot assign back, because different index in df1.index and df['ID'].value_counts().index

print (df['ID'].value_counts())
a    4
b    3
c    3
d    1
Name: ID, dtype: int64

If convert to numpy array only first 4 values are assigned, because in this DataFrame are 4 groups a,b,c,d, so df.duplicated(subset=['ID']) returned 4 times Trues, but in order 4,3,3,1 what reason of wrong output:

print (df['ID'].value_counts().values)
[4 3 3 1]

What need - new column (Series) with same df.index:

print (df['ID'].map(df['ID'].value_counts()))
0     4
1     4
2     3
3     4
4     3
5     3
6     3
7     4
8     3
9     3
10    1
Name: ID, dtype: int64

print (df.groupby('ID')['ID'].transform('size'))
0     4
1     4
2     3
3     4
4     3
5     3
6     3
7     4
8     3
9     3
10    1
Name: ID, dtype: int64

pandas dataframe duplicate values count not properly working

Answers (2)

Related Questions