Density clustering around a separate point - Python

Question

I'm aiming to cluster xy points based on their proximity. Specifically, grouping points that are positioned closely to each other. I'm also hoping to use a separate reference point to cluster the data from.

Note: I have multiple sets of data that need to be clustered independently. For example using below, each unique value in Item signifies a different set of data. I could have multiple unique sets of data that all vary in sparsity. Therefore, any technique that passes a predetermined number of clusters isn't realistic as I'll have to manually check the fit and adjust the appropriate parameter every time.

As such, the best method thus far has been some form of density clustering (DBSCAN, OPTICS).

However, while I'm clustering points that are closely together, I'm hoping to pass some cut-off to keep the intended cluster spherical. On the other hand, I don't want to reduce the reachable area too much as I'm missing points that are close to the reference point and the core points but a small gap discards points that I'm hoping to include.

The following displays the dilemma below. Item 1 represents how the reachable should be lower to ensure the clustered points around the reference pint is spherical. While Item 2 shows how the reachable area needs to be higher to allow for points that are within the dense area to be included.

I'm hoping I can adjust a parameter or include a separate feature rather than force it. Because the dense area around the reference point can vary I'm reluctant to force every point outside a specific radius to be excluded.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
import seaborn as sns
from sklearn.cluster import OPTICS

fig, ax = plt.subplots(figsize = (6,6))
ax.grid(False)

df = pd.DataFrame({   
    'Item' : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],                                
    'x' : [-4.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,10.0,-2.0,2.0,5.0,7.5,15.0,0.0,-22.0,-20.0,-20.0,-6.5,20.5,0.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0,-2.0,0.0,3.0,-3.0,-7.0,-7.5,-9.0,-4.0,1.5,-1.0,-5.0,-4.5,-3.7,15.0,-20.0,-22.0,-20.0,-20.0,-12.0,20.5,6.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
    'y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,0.0,0.0,-2.0,-2.0,-7.0,-0.5,-10.5,-7.5,0.0,16.0,-15.0,5.0,13.5,3.0,-20.0,2.0,-17.5,-15,19.0,20.0,4.0,-2.0,0.0,0.0,2.5,2.0,-1.5,5.0,0.0,3.5,2.0,-5.5,-6.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,6.0,-20.0,2.0,-17.5,-15,19.0,20.0],     
    'X_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0],
    'Y_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0],           
   })

# not spherical
df = df[df['Item'] == 1]

# spherical but reachable area too small
#df = df[df['Item'] == 2]

df['distance'] = np.sqrt((df['X_Ref'] - df['x'])**2 + (df['Y_Ref'] - df['y'])**2)

Y_sklearn = df[['x','y']].values

ax.scatter(df['x'], df['y'], marker = 'o', s = 5)
ax.scatter(df['X_Ref'], df['Y_Ref'], c = 'w', edgecolor = 'k', marker = 'o', s = 7.5, zorder = 2)

#clusterer = DBSCAN(eps = 7.5, min_samples = 3)
#labels_clusters = clusterer.fit_predict(Y_sklearn)

clusterer = OPTICS(min_samples = 2, xi = 0.25, min_cluster_size = 0.25, max_eps = 5)
clusterer.fit(Y_sklearn)
labels_clusters = clusterer.fit_predict(Y_sklearn)

#Add cluster labels as a new column to original DataFrame.
df['cluster'] = labels_clusters
df['cluster'] = df['cluster'].astype('category')

sns.scatterplot(data = df,
            x = 'x',
            y = 'y',
            hue = 'cluster',
            ax = ax,
            legend = 'full',                
            )

Item 1: points to the right of radius should be excluded from core points

Item 2: points within radius should be included in core points

Density clustering around a separate point - Python

Answers (1)

By clustering using distance

For df:

For df1:

By using marginal increase of area

For df:

For df1:

Related Questions