Reputation: 6197
Say that my data looks like this, where all the data is sorted by the value in column b
a b
1 32
4 32
5 32
9 45
8 45
3 76
5 76
7 76
9 76
What would be the most efficient way to have the first row that contains a particular column value, be the total number of times that the particular column value occurs. For the rest of the rows with that same column value, I want them to be a different value (string, -1, nan, etc, but not a positive integer) . In the example below, I use '-1' as that different value
a b count b
1 32 3
4 32 -1
5 32 -1
9 45 2
8 45 -1
3 76 4
5 76 -1
7 76 -1
9 76 -1
So in the table above, the first row where column b is 32 will have a 'Count b' value of 3, and then the rest of the rows where column is 32 will have a 'Count b' value of -1.
Upvotes: 1
Views: 89
Reputation: 323226
Using value_count
with fillna
df['New']=df.b.drop_duplicates().map(df.b.value_counts())
df.New.fillna(-1,inplace=True)
df.New=df.New.astype(int)
df
Out[197]:
a b New
0 1 32 3
1 4 32 -1
2 5 32 -1
3 9 45 2
4 8 45 -1
5 3 76 4
6 5 76 -1
7 7 76 -1
8 9 76 -1
Upvotes: 1
Reputation: 164613
Using groupby.count
+ pd.Series.duplicated
:
df['count_b'] = df.groupby('b').transform('count')
df.loc[df['b'].duplicated(), 'count_b'] = -1
print(df)
a b count_b
0 1 32 3
1 4 32 -1
2 5 32 -1
3 9 45 2
4 8 45 -1
5 3 76 4
6 5 76 -1
7 7 76 -1
8 9 76 -1
You can, if you wish, combine the two steps with numpy.where
:
import numpy as np
df['count_b'] = np.where(df['b'].duplicated(), -1,
df.groupby('b')['b'].transform(len))
Upvotes: 2
Reputation: 57033
Start by computing the span lengths:
df = df.merge(df.groupby('b').size().reset_index())
# a b 0
#0 1 32 3
#1 4 32 3
#2 5 32 3
#3 9 45 2
#4 8 45 2
#5 3 76 4
#6 5 76 4
#7 7 76 4
#8 9 76 4
Replace the duplicate lengths in each span with -1:
df.loc[df.duplicated(subset=['b',0]), 0] = -1
# a b 0
#0 1 32 3
#1 4 32 -1
#2 5 32 -1
#3 9 45 2
#4 8 45 -1
#5 3 76 4
#6 5 76 -1
#7 7 76 -1
#8 9 76 -1
Upvotes: 2