Teddy
Teddy

Reputation: 131

Supervised Machine Learning: Classify types of clusters of data based on shape and density (Python)

I have multiple sets of data, and in each set of data there is a region that is somewhat banana shaped and two regions that are dense blobs. I have been able to differentiate these regions from the rest of the data using a DBSCAN algorithm, but I'd like to use a supervised algorithm to have the program then know which cluster is the banana, and which two clusters are the dense blobs, and I'm not sure where to start.

As there are 3 categories (banana, blob, neither), would doing two separate logistic regressions be the best approach (evaluate if it is banana or not-banana and if it is blob or not-blob)? or is there a good way to incorporate all 3 categories into one neural network?

Here are three data sets. In each, the banana is red. In the 1st, the two blobs are green and blue, in the 2nd the blobs are cyan and green, and in the the 3rd the blobs are blue and green. I'd like the program to (now that is has differentiated the different regions, to then label the banana and blob regions so I don't have to hand pick them every time I run the code.

Data set 1 Data set 2 Data set 3

Upvotes: 2

Views: 3919

Answers (3)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

I believe you are still unclear about what you want to achieve.

That of course makes it hard to give you a good answer.

Your data seems to be 3D. In 3D you could for example compute the alpha shape of a cluster, and check if it is convex. Because your "banana" probably is not convex, while your blobs are.

You could also measure e.g. whether the cluster center actually is inside your cluster. If it isn't, the cluster is not a blob. You can measure if the extends along the three axes are the same or not.

But in the end, you need some notion of "banana".

Upvotes: 0

lejlot
lejlot

Reputation: 66775

As you are using python, one of the best options would be to start with some big library, offering many different approaches so you can choose which one suits you the best. One of such libraries is sklearn http://scikit-learn.org/stable/ .

Getting back to the problem itself. What are the models you should try?

  • Support Vector Machines - this model has been around for a while, and became a gold standard in many fields, mostly due to its elegant mathematical interpretation and ease of use (it has much less parameters to worry about then classical neural networks for instance). It is a binary classification model, but library automaticaly will create a multi-classifier version for you
  • Decision tree - very easy to understand, yet creates quite "rough" decision boundaries
  • Random forest - model often used in the more statistical community,
  • K-nearest neighours - most simple approach, but if you can so easily define shapes of your data, it will provide very good results, while remaining very easy to understand

Of course there are many others, but I would recommend to start with these ones. All of them support multi-class classification, so you do not need to worry how to encode the problem with three classes, simply create data in the form of two matrices x and y where x are input values and y is a vector of corresponding classes (eg. numbers from 1 to 3).

Visualization of different classifiers from the library:

classifiers comparision

So it remains a question how to represent shape of a cluster - we need a fixed length real valued vector, so what can features actually represent?

  • center of mass (if position matters)
  • skewness/kurtosis
  • covariance matrix (or its eigenvalues) (if rotation matters)
  • some kind of local density estimation
  • histograms of some statistics (like histogram of pairwise Euclidean distances between pairs of points on the shape)
  • many, many more!

There is quite comprehensive list and detailed overview here (for three-dimensional objects): http://web.ist.utl.pt/alfredo.ferreira/publications/DecorAR-Surveyon3DShapedescriptors.pdf

There is also quite informative presentation: http://www.global-edge.titech.ac.jp/faculty/hamid/courses/shapeAnalysis/files/3.A.ShapeRepresentation.pdf

Describing some descriptors and how to make them scale/position/rotation invariant (if it is relevant here)

Upvotes: 4

Christopher Lawless
Christopher Lawless

Reputation: 1087

Could Neural networks help , the "pybrain" library might be the best for it.

You could set up the neural net as a feed forward network. set it so that there is an output for each class of object you expect the data to contain.

Edit :sorry if I have completely misinterpreted the question. I'm assuming you have preexisting data you can feed to train the networks to differentiate clusters.

If there are 3 categories you could have 3 outputs to the NN or perhaps a single NN for each one that simply outputs a true or false value.

Upvotes: 0

Related Questions