user3046211
user3046211

Reputation: 696

Duplicate count for a column in Pandas dataframe

I have a dataframe df like below:

import pandas as pd

data = {'A': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'XYZ', 'XYZ', 'ABC', 'XYZ', 'ABC'], 'B': ['2022-02-16 14:00:31', '2022-02-16 16:11:26', '2022-02-16 17:31:26',
                                                                                           '2022-02-16 22:47:46', '2022-02-17 07:11:11', '2022-02-17 10:43:36', 
                                                                                           '2022-02-17 15:05:11', '2022-02-17 15:07:25', '2022-02-17 15:08:35', 
                                                                                           '2022-02-17 15:09:46'], 'C': [1,0,0,0,1,0,0,1,0,0]}
df = pd.DataFrame(data)
df['B'] = pd.to_datetime(df['B'])
df


     |   A   |          B           |     C      |  
     +-------+----------------------+------------+
     |  XYZ  |  2022-02-16 14:00:31 |     1      |
     |  XYZ  |  2022-02-16 16:11:26 |     0      |
     |  XYZ  |  2022-02-16 17:31:26 |     0      |
     |  XYZ  |  2022-02-16 22:47:46 |     0      |
     |  PQR  |  2022-02-17 07:11:11 |     1      |
     |  XYZ  |  2022-02-17 10:43:36 |     0      |
     |  XYZ  |  2022-02-17 15:05:11 |     0      |
     |  ABC  |  2022-02-17 15:07:25 |     1      |
     |  XYZ  |  2022-02-17 15:08:35 |     0      |
     |  ABC  |  2022-02-17 15:09:46 |     0      |
     +-------+----------------------+------------+

What I want to achieve is that I want to count the number of duplicates for each of the occurences of XYZ, PQR, ABC such that I get an output like below.


Expected Output :

     |   A   |          B           |     C      |  Count   |   
     +-------+----------------------+------------+----------+
     |  XYZ  |  2022-02-16 14:00:31 |     1      |    7     |
     |  XYZ  |  2022-02-16 16:11:26 |     0      |          |
     |  XYZ  |  2022-02-16 17:31:26 |     0      |          |
     |  XYZ  |  2022-02-16 22:47:46 |     0      |          |
     |  PQR  |  2022-02-17 07:11:11 |     1      |    1     |
     |  XYZ  |  2022-02-17 10:43:36 |     0      |          |
     |  XYZ  |  2022-02-17 15:05:11 |     0      |          |
     |  ABC  |  2022-02-17 15:07:25 |     1      |    2     |
     |  XYZ  |  2022-02-17 15:08:35 |     0      |          |
     |  ABC  |  2022-02-17 15:09:45 |     0      |          |
     +-------+----------------------+------------+----------+

Currently, I'm trying to achieve the same by using the code below but I'm unable to get expected/desired results.


one_index = df[df['C'] == 1].index
zero_index = df[df['C'] == 0].index

df.loc[0, 'Count'] = len(df)


     |   A   |          B           |     C      |  Count   |   
     +-------+----------------------+------------+----------+
     |  XYZ  |  2022-02-16 14:00:31 |     1      |    10    |
     |  XYZ  |  2022-02-16 16:11:26 |     0      |          |
     |  XYZ  |  2022-02-16 17:31:26 |     0      |          |
     |  XYZ  |  2022-02-16 22:47:46 |     0      |          |
     |  PQR  |  2022-02-17 07:11:11 |     1      |          |
     |  XYZ  |  2022-02-17 10:43:36 |     0      |          |
     |  XYZ  |  2022-02-17 15:05:11 |     0      |          |
     |  ABC  |  2022-02-17 15:07:25 |     1      |          |
     |  XYZ  |  2022-02-17 15:08:35 |     0      |          |
     |  ABC  |  2022-02-17 15:09:45 |     0      |          |
     +-------+----------------------+------------+----------+

So, how can I get count of duplicates for each of the values of column A as mentioned ?

EDIT (OPTIONAL):

I would also like to have ID assigned to the groups for the df after the count values has been assigned. So, my final dataframe should look like below after ID has been assigned:

    |   A   |          B           |     C      |  Count   |   ID  |
    +-------+----------------------+------------+----------+-------+
    |  XYZ  |  2022-02-16 14:00:31 |     1      |    7     | ABC_1 |
    |  XYZ  |  2022-02-16 16:11:26 |     0      |          |       |
    |  XYZ  |  2022-02-16 17:31:26 |     0      |          |       |
    |  XYZ  |  2022-02-16 22:47:46 |     0      |          |       |
    |  PQR  |  2022-02-17 07:11:11 |     1      |    1     | ABC_2 |
    |  XYZ  |  2022-02-17 10:43:36 |     0      |          |       | 
    |  XYZ  |  2022-02-17 15:05:11 |     0      |          |       |
    |  ABC  |  2022-02-17 15:07:25 |     1      |    2     | ABC_3 |
    |  XYZ  |  2022-02-17 15:08:35 |     0      |          |       | 
    |  ABC  |  2022-02-17 15:09:45 |     0      |          |       |
    +-------+----------------------+------------+----------+-------+

Upvotes: 1

Views: 108

Answers (2)

jezrael
jezrael

Reputation: 862661

Use GroupBy.transform with set empty strings for non 1 values in C by Series.where:

df['B'] = pd.to_datetime(df['B'])

m = df.C.eq(1)
df['Count'] = df.groupby('A')['C'].transform('size').where(m, '')
df.loc[m, 'ID'] = 'ABC_' + pd.RangeIndex(1, m.sum() + 1).astype(str)
df['ID'] = df['ID'].fillna('')
print (df)
     A                   B  C Count     ID
0  XYZ 2022-02-16 14:00:31  1     7  ABC_1
1  XYZ 2022-02-16 16:11:26  0             
2  XYZ 2022-02-16 17:31:26  0             
3  XYZ 2022-02-16 22:47:46  0             
4  PQR 2022-02-17 07:11:11  1     1  ABC_2
5  XYZ 2022-02-17 10:43:36  0             
6  XYZ 2022-02-17 15:05:11  0             
7  ABC 2022-02-17 15:07:25  1     2  ABC_3
8  XYZ 2022-02-17 15:08:35  0             
9  ABC 2022-02-17 15:09:46  0         

Or:

df['B'] = pd.to_datetime(df['B'])

m = df.C.eq(1)
df['Count'] = df.groupby('A')['C'].transform('size').where(m, '')
df['ID'] = ('ABC_' + df['C'].cumsum().astype(str)).where(m, '')

print (df)
     A                   B  C Count     ID
0  XYZ 2022-02-16 14:00:31  1     7  ABC_1
1  XYZ 2022-02-16 16:11:26  0             
2  XYZ 2022-02-16 17:31:26  0             
3  XYZ 2022-02-16 22:47:46  0             
4  PQR 2022-02-17 07:11:11  1     1  ABC_2
5  XYZ 2022-02-17 10:43:36  0             
6  XYZ 2022-02-17 15:05:11  0             
7  ABC 2022-02-17 15:07:25  1     2  ABC_3
8  XYZ 2022-02-17 15:08:35  0             
9  ABC 2022-02-17 15:09:46  0                 

Upvotes: 1

mozway
mozway

Reputation: 260640

Use value_counts and map with boolean indexing:

df['Count'] = df.loc[df['C'].eq(1), 'A'].map(df['A'].value_counts())

output:

     A                   B  C  Count
0  XYZ 2022-02-16 14:00:31  1    7.0
1  XYZ 2022-02-16 16:11:26  0    NaN
2  XYZ 2022-02-16 17:31:26  0    NaN
3  XYZ 2022-02-16 22:47:46  0    NaN
4  PQR 2022-02-17 07:11:11  1    1.0
5  XYZ 2022-02-17 10:43:36  0    NaN
6  XYZ 2022-02-17 15:05:11  0    NaN
7  ABC 2022-02-17 15:07:25  1    2.0
8  XYZ 2022-02-17 15:08:35  0    NaN
9  ABC 2022-02-17 15:09:46  0    NaN

Upvotes: 1

Related Questions