Advice on doing clustering (and/or) dimensionality reduction on large dataset with alot of features

Question

I have a dataset with 38,000 features and 7 million data points (not sure if this is relevant but alot of the features are sparse). I was tasked with doing some clustering on this data. I thought a good place to start would be PCA or some other form of dimensionality reduction. However all of these are EXTREMELY slow with MASSIVE time complexities for data of this size/nature. Does anyone have any advice on how to go about doing clustering on this dataset? Should I be doing dimensionality reduction, if so how could I feasibly do it? Any advice is appreciated.

So far I've tried PCA and kernelPCA. I've also tried these on a subset of the data that has only 250,000 points but it still takes a crazy amount of time.

Advice on doing clustering (and/or) dimensionality reduction on large dataset with alot of features

Answers (1)

Related Questions