Reputation: 327
I have a bunch of records, each labeled with a cluster value.
Original dataframe, df:
+-------------+---------+
| measurement | cluster |
+-------------+---------+
| M1 | 6 |
| M2 | 6 |
| M3 | 6 |
| M4 | 12 |
| M5 | 12 |
| M6 | 12 |
| M7 | 2 |
| M8 | 9 |
| M9 | 9 |
| M10 | 9 |
| M11 | 9 |
+-------------+---------+
How can I rename the clusters to a new number based on whether the current cluster value is equal to the prior and next one while assigning to "x" rows whose cluster values do not equal to the prior or next one?
Desired df:
+-------------+---------+-------------+
| measurement | cluster | new_cluster |
+-------------+---------+-------------+
| M1 | 6 | 1 |
| M2 | 6 | 1 |
| M3 | 6 | 1 |
| M4 | 12 | 2 |
| M5 | 12 | 2 |
| M6 | 12 | 2 |
| M7 | 2 | x |
| M8 | 9 | 3 |
| M9 | 9 | 3 |
| M10 | 9 | 3 |
| M11 | 9 | 3 |
+-------------+---------+-------------+
Upvotes: 2
Views: 770
Reputation: 294536
df = pd.DataFrame(dict(
measurement=[f"M{i}" for i in range(1, 12)],
cluster=[*map(int, '6 6 6 12 12 12 2 9 9 9 9'.split())]
))
c = df.cluster.values
m0 = c[:-1] == c[1:]
# Test if equal to prior or next
mask = np.append(False, m0) | np.append(m0, False)
f, u = pd.factorize(c[mask])
new = np.array(list('x' * len(c)), dtype=object)
new[mask] = f + 1
df.assign(new_cluster=new)
measurement cluster new_cluster
0 M1 6 1
1 M2 6 1
2 M3 6 1
3 M4 12 2
4 M5 12 2
5 M6 12 2
6 M7 2 x
7 M8 9 3
8 M9 9 3
9 M10 9 3
10 M11 9 3
Upvotes: 2
Reputation: 863691
Use pandas.factorize
for filtered values by mask:
m = df['cluster'].ne(df['cluster'].shift()).cumsum().duplicated(keep=False)
df.loc[m, 'new_cluster'] = pd.factorize(df.loc[m, 'cluster'])[0] + 1
print (df)
measurement cluster new_cluster
0 M1 6 1.0
1 M2 6 1.0
2 M3 6 1.0
3 M4 12 2.0
4 M5 12 2.0
5 M6 12 2.0
6 M7 2 NaN
7 M8 9 3.0
8 M9 9 3.0
9 M10 9 3.0
10 M11 9 3.0
If want replace NaN
to x
:
df['new_cluster'] = df['new_cluster'].fillna('x')
print (df)
measurement cluster new_cluster
0 M1 6 1
1 M2 6 1
2 M3 6 1
3 M4 12 2
4 M5 12 2
5 M6 12 2
6 M7 2 x
7 M8 9 3
8 M9 9 3
9 M10 9 3
10 M11 9 3
Details for boolean mask - first create helper Series
for consecutive values and then mask by duplicated
with keep='False' for return all dupes:
print (df['cluster'].ne(df['cluster'].shift()).cumsum())
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 4
8 4
9 4
10 4
Name: cluster, dtype: int32
print (m)
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
9 True
10 True
Name: cluster, dtype: bool
Upvotes: 4