sshen
sshen

Reputation: 1

Advice on doing clustering (and/or) dimensionality reduction on large dataset with alot of features

I have a dataset with 38,000 features and 7 million data points (not sure if this is relevant but alot of the features are sparse). I was tasked with doing some clustering on this data. I thought a good place to start would be PCA or some other form of dimensionality reduction. However all of these are EXTREMELY slow with MASSIVE time complexities for data of this size/nature. Does anyone have any advice on how to go about doing clustering on this dataset? Should I be doing dimensionality reduction, if so how could I feasibly do it? Any advice is appreciated.

So far I've tried PCA and kernelPCA. I've also tried these on a subset of the data that has only 250,000 points but it still takes a crazy amount of time.

Upvotes: 0

Views: 217

Answers (1)

micans
micans

Reputation: 1116

Given the ginormous size of the data I would look into stream clustering algorithms that incrementally build a clustering as new data becomes available. An example is BIRCH, found on that page. Further thoughts: (1) you could use any resulting clustering to partition the data and re-cluster with another approach. (2) the sparsity of features may pose additional problems even with stream clustering. (3) I would test on a series of much smaller sets of both data points and features. Hopefully there are logical choices available for such reductions. (4) With such a large sparse feature set I would normally propose to think of the data structure as a network (with edges defined and weighted by two data points having (and scoring) a non-empty overlap of features, potentially thresholded. Perhaps there is a way of doing this with a smart data structure. In bioinformatics networks with millions of nodes are common, but usually parallel/distributed compute is available to throw at the problem.

Most of all I would try to think of a pilot to figure out whether clustering holds some promise for the data set. "Would one expect clustering to work" is akin to asking "Do triangles occur more often than you'd expect in the data, do they clump together, are the clumps meaningful", which perhaps can be prodded at without full-scale clustering.

Upvotes: 0

Related Questions