How to create a synthetic dataset

Question

I want to run some Machine Learning clustering algorithms on some big data.
The problem is that I'm having troubles to find interesting data for this purpose on the web.
Also, usually this data might be inconvenient to use because the format won't fit for me.
I need a txt file which each line represents a mathematical vector, each element seperated by space, for example:

1 2.2 3.1
1.12 0.13 4.46
1 2 54.44

Therefore, I decided to first run those algorithms on some synthetic data which I'll create by my self.
How can I do this in a smart way with numpy? In smart way, I mean that it won't be generated uniformly, because it's a little bit boring. How can I generate some interesting clusters?

I want to have 5GB / 10GB of data at the moment.

lejlot · Accepted Answer

There is no one good answer for such question. What is interesting? For clustering, unfortunately, there is no such thing as an interesting or even well posed problem. Clustering as such has no well defineid evaluation, consequently each method is equally good/bad, as long as it has well defined internal objective. So k-means will always be good one to minimize inter-cluster euclidean distance and will struggle with sparse data, non-convex, imbalanced clusters. DBScan will always be the best in greedy density based sense and will strugle with diverse density clusters. GMM will be always great fitting on gaussian mixtures, and will strugle with clusters which are not gaussians (for example lines, squares etc.).

From the question one could deduce that you are at the very begining of work with clustering and so need "just anything more complex than uniform", so I suggest you take a look at datasets generators, in particular accesible in scikit-learn (python) http://scikit-learn.org/stable/datasets/ or in clusterSim (R) http://www.inside-r.org/packages/cran/clusterSim/docs/cluster.Gen or clusterGeneration (R) https://cran.r-project.org/web/packages/clusterGeneration/clusterGeneration.pdf

How to create a synthetic dataset

Answers (2)

Related Questions