Python DBSCAN clustering with periodic boundary conditions

Im a noob, probably im doing things too big for me, but i need this for my tesis, please forgive my ignorance. My goal is to do clustering on 3D points, using sklearn.cluster.DBSCAN, and implement periodic boundary condition only on x,y. The easiest way that I have found is to use the scipy function pdist on each coordinate, correct for the periodic boundaries, then combine the result in order to obtain a distance matrix (in square form) that can be digested by DBSCAN.

L=40 #box lenght
for d in range(data.shape[1]):
  # find all 1-d distances
  pd=pdist(data[:,d].reshape(data.shape[0],1))
  # apply boundary conditions (excluding z distances)
  if (d!=2):
    total+=pd**2


# transform the condensed distance matrix...
total=pl.sqrt(total)
# ...into a square distance matrix
square=squareform(total)
db=DBSCAN(eps=4, metric='precomputed').fit(square)

When i run the code i receive this error:

valueerror: a 2-dimensional array must be passed

What is the problem ? Is there another simple way to reach my goal ?

Upvotes: 1

Answers (2)

Xander

Reputation: 5597

There is now also a Python library available, based on scikit-learn, that implements DBSCAN with periodic boundary conditions:

github.com/XanderDW/PBC-DBSCAN

It also supports arbitrarily combining multi-dimensional data with some periodic and some open/closed boundaries.

Code example:

from dbscan_pbc import DBSCAN_PBC
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import numpy as np

### Generate synthetic data
centers = [[0, 0], [1, 0], [2, 0]]
X, _ = make_blobs(n_samples=80, centers=centers, cluster_std=0.1, random_state=0)
X = StandardScaler().fit_transform(X)  # Standardize the data

L = 2.0  # Box size
X = np.mod(X, L)  # Apply periodic boundary conditions

### Apply DBSCAN_PBC
db = DBSCAN_PBC(eps=0.1, min_samples=5).fit(X, pbc_lower=0, pbc_upper=L)

print(db.labels_)

Upvotes: 0

Has QUIT--Anony-Mousse

Reputation: 77485

Squreform produces a condensed distance matrix in a one-dimensional array. That is a more memory efficient representation - but only if you use it from the beginning, not convert to it later.

Anyway, this form is only used by scipy, not by sklearn. But because python does not have a strong type system, it cannot detect this error easily. But you need to remove the squareform call!

But I doubt your handling of boundary conditions is what you want. Right now you appear to be ignoring the third axis completely, and do regular distance conditions on the remainder.

Upvotes: 1

Python DBSCAN clustering with periodic boundary conditions

Answers (2)

Related Questions