dnaeye
dnaeye

Reputation: 327

How can I increase a counter based on a condition for a Pandas dataframe?

I have a bunch of records, each labeled with a cluster value.

Original dataframe, df:

+-------------+---------+
| measurement | cluster |
+-------------+---------+
| M1          |       6 |
| M2          |       6 |
| M3          |       6 |
| M4          |      12 |
| M5          |      12 |
| M6          |      12 |
| M7          |       2 |
| M8          |       9 |
| M9          |       9 |
| M10         |       9 |
| M11         |       9 |
+-------------+---------+

How can I rename the clusters to a new number based on whether the current cluster value is equal to the prior and next one while assigning to "x" rows whose cluster values do not equal to the prior or next one?

Desired df:

+-------------+---------+-------------+
| measurement | cluster | new_cluster |
+-------------+---------+-------------+
| M1          |       6 |           1 |
| M2          |       6 |           1 |
| M3          |       6 |           1 |
| M4          |      12 |           2 |
| M5          |      12 |           2 |
| M6          |      12 |           2 |
| M7          |       2 |           x |
| M8          |       9 |           3 |
| M9          |       9 |           3 |
| M10         |       9 |           3 |
| M11         |       9 |           3 |
+-------------+---------+-------------+

Upvotes: 2

Views: 770

Answers (2)

piRSquared
piRSquared

Reputation: 294536

Setup

df = pd.DataFrame(dict(
    measurement=[f"M{i}" for i in range(1, 12)],
    cluster=[*map(int, '6 6 6 12 12 12 2 9 9 9 9'.split())]
))

Solution

c = df.cluster.values
m0 = c[:-1] == c[1:]

# Test if equal to prior or next
mask = np.append(False, m0) | np.append(m0, False)

f, u = pd.factorize(c[mask])
new = np.array(list('x' * len(c)), dtype=object)
new[mask] = f + 1

df.assign(new_cluster=new)

   measurement  cluster new_cluster
0           M1        6           1
1           M2        6           1
2           M3        6           1
3           M4       12           2
4           M5       12           2
5           M6       12           2
6           M7        2           x
7           M8        9           3
8           M9        9           3
9          M10        9           3
10         M11        9           3

Upvotes: 2

jezrael
jezrael

Reputation: 863691

Use pandas.factorize for filtered values by mask:

m = df['cluster'].ne(df['cluster'].shift()).cumsum().duplicated(keep=False)
df.loc[m, 'new_cluster'] =  pd.factorize(df.loc[m, 'cluster'])[0] + 1
print (df)
   measurement  cluster  new_cluster
0           M1        6          1.0
1           M2        6          1.0
2           M3        6          1.0
3           M4       12          2.0
4           M5       12          2.0
5           M6       12          2.0
6           M7        2          NaN
7           M8        9          3.0
8           M9        9          3.0
9          M10        9          3.0
10         M11        9          3.0

If want replace NaN to x:

df['new_cluster'] = df['new_cluster'].fillna('x')
print (df)
   measurement  cluster new_cluster
0           M1        6           1
1           M2        6           1
2           M3        6           1
3           M4       12           2
4           M5       12           2
5           M6       12           2
6           M7        2           x
7           M8        9           3
8           M9        9           3
9          M10        9           3
10         M11        9           3

Details for boolean mask - first create helper Series for consecutive values and then mask by duplicated with keep='False' for return all dupes:

print (df['cluster'].ne(df['cluster'].shift()).cumsum())
0     1
1     1
2     1
3     2
4     2
5     2
6     3
7     4
8     4
9     4
10    4
Name: cluster, dtype: int32

print (m)
0      True
1      True
2      True
3      True
4      True
5      True
6     False
7      True
8      True
9      True
10     True
Name: cluster, dtype: bool

Upvotes: 4

Related Questions