SantoshGupta7
SantoshGupta7

Reputation: 6197

Pandas: How to get a row count by the value of a particular column value, and add the count as another column.

Say that my data looks like this, where all the data is sorted by the value in column b

a   b
1   32
4   32
5   32
9   45
8   45
3   76
5   76
7   76
9   76

What would be the most efficient way to have the first row that contains a particular column value, be the total number of times that the particular column value occurs. For the rest of the rows with that same column value, I want them to be a different value (string, -1, nan, etc, but not a positive integer) . In the example below, I use '-1' as that different value

a   b   count b
1   32  3
4   32  -1
5   32  -1
9   45  2
8   45  -1
3   76  4
5   76  -1
7   76  -1
9   76  -1

So in the table above, the first row where column b is 32 will have a 'Count b' value of 3, and then the rest of the rows where column is 32 will have a 'Count b' value of -1.

Upvotes: 1

Views: 89

Answers (3)

BENY
BENY

Reputation: 323226

Using value_count with fillna

df['New']=df.b.drop_duplicates().map(df.b.value_counts())
df.New.fillna(-1,inplace=True)
df.New=df.New.astype(int)
df
Out[197]: 
   a   b  New
0  1  32    3
1  4  32   -1
2  5  32   -1
3  9  45    2
4  8  45   -1
5  3  76    4
6  5  76   -1
7  7  76   -1
8  9  76   -1

Upvotes: 1

jpp
jpp

Reputation: 164613

Using groupby.count + pd.Series.duplicated:

df['count_b'] = df.groupby('b').transform('count')
df.loc[df['b'].duplicated(), 'count_b'] = -1

print(df)

   a   b  count_b
0  1  32        3
1  4  32       -1
2  5  32       -1
3  9  45        2
4  8  45       -1
5  3  76        4
6  5  76       -1
7  7  76       -1
8  9  76       -1

You can, if you wish, combine the two steps with numpy.where:

import numpy as np

df['count_b'] = np.where(df['b'].duplicated(), -1,
                         df.groupby('b')['b'].transform(len))

Upvotes: 2

DYZ
DYZ

Reputation: 57033

Start by computing the span lengths:

df = df.merge(df.groupby('b').size().reset_index())
#   a   b  0
#0  1  32  3
#1  4  32  3
#2  5  32  3
#3  9  45  2
#4  8  45  2
#5  3  76  4
#6  5  76  4
#7  7  76  4
#8  9  76  4

Replace the duplicate lengths in each span with -1:

df.loc[df.duplicated(subset=['b',0]), 0] = -1

#   a   b  0
#0  1  32  3
#1  4  32 -1
#2  5  32 -1
#3  9  45  2
#4  8  45 -1
#5  3  76  4
#6  5  76 -1
#7  7  76 -1
#8  9  76 -1

Upvotes: 2

Related Questions