Reputation: 368
I'm aiming to cluster xy points based on their proximity. Specifically, grouping points that are positioned closely to each other. I'm also hoping to use a separate reference point to cluster the data from.
Note: I have multiple sets of data that need to be clustered independently. For example using below, each unique value in Item
signifies a different set of data. I could have multiple unique sets of data that all vary in sparsity. Therefore, any technique that passes a predetermined number of clusters isn't realistic as I'll have to manually check the fit and adjust the appropriate parameter every time.
As such, the best method thus far has been some form of density clustering (DBSCAN, OPTICS).
However, while I'm clustering points that are closely together, I'm hoping to pass some cut-off to keep the intended cluster spherical. On the other hand, I don't want to reduce the reachable area too much as I'm missing points that are close to the reference point and the core points but a small gap discards points that I'm hoping to include.
The following displays the dilemma below. Item 1
represents how the reachable should be lower to ensure the clustered points around the reference pint is spherical. While Item 2
shows how the reachable area needs to be higher to allow for points that are within the dense area to be included.
I'm hoping I can adjust a parameter or include a separate feature rather than force it. Because the dense area around the reference point can vary I'm reluctant to force every point outside a specific radius to be excluded.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
import seaborn as sns
from sklearn.cluster import OPTICS
fig, ax = plt.subplots(figsize = (6,6))
ax.grid(False)
df = pd.DataFrame({
'Item' : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],
'x' : [-4.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,10.0,-2.0,2.0,5.0,7.5,15.0,0.0,-22.0,-20.0,-20.0,-6.5,20.5,0.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0,-2.0,0.0,3.0,-3.0,-7.0,-7.5,-9.0,-4.0,1.5,-1.0,-5.0,-4.5,-3.7,15.0,-20.0,-22.0,-20.0,-20.0,-12.0,20.5,6.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
'y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,0.0,0.0,-2.0,-2.0,-7.0,-0.5,-10.5,-7.5,0.0,16.0,-15.0,5.0,13.5,3.0,-20.0,2.0,-17.5,-15,19.0,20.0,4.0,-2.0,0.0,0.0,2.5,2.0,-1.5,5.0,0.0,3.5,2.0,-5.5,-6.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,6.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0],
'Y_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0],
})
# not spherical
df = df[df['Item'] == 1]
# spherical but reachable area too small
#df = df[df['Item'] == 2]
df['distance'] = np.sqrt((df['X_Ref'] - df['x'])**2 + (df['Y_Ref'] - df['y'])**2)
Y_sklearn = df[['x','y']].values
ax.scatter(df['x'], df['y'], marker = 'o', s = 5)
ax.scatter(df['X_Ref'], df['Y_Ref'], c = 'w', edgecolor = 'k', marker = 'o', s = 7.5, zorder = 2)
#clusterer = DBSCAN(eps = 7.5, min_samples = 3)
#labels_clusters = clusterer.fit_predict(Y_sklearn)
clusterer = OPTICS(min_samples = 2, xi = 0.25, min_cluster_size = 0.25, max_eps = 5)
clusterer.fit(Y_sklearn)
labels_clusters = clusterer.fit_predict(Y_sklearn)
#Add cluster labels as a new column to original DataFrame.
df['cluster'] = labels_clusters
df['cluster'] = df['cluster'].astype('category')
sns.scatterplot(data = df,
x = 'x',
y = 'y',
hue = 'cluster',
ax = ax,
legend = 'full',
)
Item 1: points to the right of radius should be excluded from core points
Item 2: points within radius should be included in core points
Upvotes: 0
Views: 869
Reputation: 2816
I believe we could reformulate the problem. I am not sure the clustering approach is the best.
""""
https://stackoverflow.com/questions/66099958/density-clustering-around-a-separate-point-python
"""
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
import seaborn as sns
from sklearn.cluster import OPTICS
from sklearn.cluster import MiniBatchKMeans, KMeans
import matplotlib.pyplot as plt
# not spherical
df = pd.DataFrame({
'x' : [-4.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,12.0,-2.0,2.0,8.0,8.5,15.0,-20.0,-22.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
'y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,0.0,0.0,-2.0,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,3.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
'Y_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
})
# spherical but reachable area too small
df1 = pd.DataFrame({
'x' : [-2.0,0.0,2.0,-3.0,-7.0,-7.5,-9.0,-4.0,1.5,-1.0,-5.0,-4.5,-3.7,15.0,-20.0,-22.0,-20.0,-20.0,-15.0,20.5,8.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
'y' : [4.0,-2.0,0.0,0.0,2.5,2.0,-2.0,5.0,0.0,3.5,2.0,-5.5,-6.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,5.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X_Ref' : [-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0],
'Y_Ref' : [-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0],
})
#Distance calculations
df['distance'] = np.sqrt((df['X_Ref'] - df['x'])**2 + (df['Y_Ref'] - df['y'])**2)
def distance_func(df):
return np.sqrt((df['X_Ref'] - df['x']) ** 2 + (df['Y_Ref'] - df['y']) ** 2)
df1['distance'] = distance_func(df1)
# Change this for the graphs
df = df1.copy()
Y_sklearn = df['distance'].values.reshape(-1, 1)
fig, ax = plt.subplots(figsize = (6,6))
ax.grid(False)
ax.scatter(df['x'], df['y'], marker = 'o', s = 5)
ax.scatter(df['X_Ref'], df['Y_Ref'], c = 'w', edgecolor = 'k', marker = 'o', s = 7.5, zorder = 2)
clusterer = KMeans(init='k-means++', n_clusters=2, n_init=10)
clusterer.fit(Y_sklearn)
labels_clusters = clusterer.fit_predict(Y_sklearn)
#Add cluster labels as a new column to original DataFrame.
df['cluster'] = labels_clusters
df['cluster'] = df['cluster'].astype('category')
sns.scatterplot(data = df,
x = 'x',
y = 'y',
hue = 'cluster',
ax = ax,
legend = 'full',
)
As mentioned earlier I believe the problem could be reformulate using the idea of marginal area. Each point we add every time will increase the are considered in different ways.
In other words, use the elbow method for each point.
For area calculation I will just proxy be distance to the power of two.
Code:
""""
https://stackoverflow.com/questions/66099958/density-clustering-around-a-separate-point-python
"""
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
import seaborn as sns
from sklearn.cluster import OPTICS
from sklearn.cluster import MiniBatchKMeans, KMeans
import matplotlib.pyplot as plt
# not spherical
df = pd.DataFrame({
'x' : [-4.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,12.0,-2.0,2.0,8.0,8.5,15.0,-20.0,-22.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
'y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,0.0,0.0,-2.0,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,3.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
'Y_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
})
# spherical but reachable area too small
df1 = pd.DataFrame({
'x' : [-2.0,0.0,2.0,-3.0,-7.0,-7.5,-9.0,-4.0,1.5,-1.0,-5.0,-4.5,-3.7,15.0,-20.0,-22.0,-20.0,-20.0,-15.0,20.5,8.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
'y' : [4.0,-2.0,0.0,0.0,2.5,2.0,-2.0,5.0,0.0,3.5,2.0,-5.5,-6.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,5.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X_Ref' : [-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0],
'Y_Ref' : [-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0],
})
df['distance'] = np.sqrt((df['X_Ref'] - df['x'])**2 + (df['Y_Ref'] - df['y'])**2)
def distance_func(df):
return np.sqrt((df['X_Ref'] - df['x']) ** 2 + (df['Y_Ref'] - df['y']) ** 2)
df1['distance'] = distance_func(df1)
# To shiwtch from one dataset to another.
#df=df1.copy()
df['distance_2'] = df['distance']**2
df.sort_values('distance',inplace=True)
#pd.DataFrame(df['marginal_change'].values).plot()
aux = pd.DataFrame(df['distance_2'].values, columns=['distance ** 2'])
aux.plot()
fig, ax = plt.subplots(figsize = (6,6))
ax.grid(False)
ax.scatter(df['x'], df['y'], marker = 'o', s = 5)
ax.scatter(df['X_Ref'], df['Y_Ref'], c = 'w', edgecolor = 'k', marker = 'o', s = 7.5, zorder = 2)
selected_top=10
labels_clusters = np.zeros(df.shape[0])
labels_clusters[0:selected_top] =1
#Add cluster labels as a new column to original DataFrame.
df['cluster'] = labels_clusters
df['cluster'] = df['cluster'].astype('category')
sns.scatterplot(data = df,
x = 'x',
y = 'y',
hue = 'cluster',
ax = ax,
legend = 'full',
)
Scree plot
From the scree plot you can see were the number of points is becoming too much. I will say the selection of 10 points could be good. The selection is based on the Elbow method.
Final plot:
Following Elbow method criteria 13 points could be the optimal.
Upvotes: 1