How can I increase a counter based on a condition for a Pandas dataframe?

Question

I have a bunch of records, each labeled with a cluster value.

Original dataframe, df:

+-------------+---------+
| measurement | cluster |
+-------------+---------+
| M1          |       6 |
| M2          |       6 |
| M3          |       6 |
| M4          |      12 |
| M5          |      12 |
| M6          |      12 |
| M7          |       2 |
| M8          |       9 |
| M9          |       9 |
| M10         |       9 |
| M11         |       9 |
+-------------+---------+

How can I rename the clusters to a new number based on whether the current cluster value is equal to the prior and next one while assigning to "x" rows whose cluster values do not equal to the prior or next one?

Desired df:

+-------------+---------+-------------+
| measurement | cluster | new_cluster |
+-------------+---------+-------------+
| M1          |       6 |           1 |
| M2          |       6 |           1 |
| M3          |       6 |           1 |
| M4          |      12 |           2 |
| M5          |      12 |           2 |
| M6          |      12 |           2 |
| M7          |       2 |           x |
| M8          |       9 |           3 |
| M9          |       9 |           3 |
| M10         |       9 |           3 |
| M11         |       9 |           3 |
+-------------+---------+-------------+

jezrael · Accepted Answer

Use pandas.factorize for filtered values by mask:

m = df['cluster'].ne(df['cluster'].shift()).cumsum().duplicated(keep=False)
df.loc[m, 'new_cluster'] =  pd.factorize(df.loc[m, 'cluster'])[0] + 1
print (df)
   measurement  cluster  new_cluster
0           M1        6          1.0
1           M2        6          1.0
2           M3        6          1.0
3           M4       12          2.0
4           M5       12          2.0
5           M6       12          2.0
6           M7        2          NaN
7           M8        9          3.0
8           M9        9          3.0
9          M10        9          3.0
10         M11        9          3.0

If want replace NaN to x:

df['new_cluster'] = df['new_cluster'].fillna('x')
print (df)
   measurement  cluster new_cluster
0           M1        6           1
1           M2        6           1
2           M3        6           1
3           M4       12           2
4           M5       12           2
5           M6       12           2
6           M7        2           x
7           M8        9           3
8           M9        9           3
9          M10        9           3
10         M11        9           3

Details for boolean mask - first create helper Series for consecutive values and then mask by duplicated with keep='False' for return all dupes:

print (df['cluster'].ne(df['cluster'].shift()).cumsum())
0     1
1     1
2     1
3     2
4     2
5     2
6     3
7     4
8     4
9     4
10    4
Name: cluster, dtype: int32

print (m)
0      True
1      True
2      True
3      True
4      True
5      True
6     False
7      True
8      True
9      True
10     True
Name: cluster, dtype: bool

How can I increase a counter based on a condition for a Pandas dataframe?

Answers (2)

Setup

Solution

Related Questions