Tom Updike
Tom Updike

Reputation: 73

Memory issue with K-means clustering

I'm trying to cluster key phrases from a search history using K means clustering, but I run into the error "cannot allocate vector of size 30gb" when I run the stringdistmatrix() command. The dataset I am using includes 63455 unique elements, so the resulting matrix requires about 30gb of memory to process. Is there a way to lower the requirements of the process without losing too much significance?

Below is the code I am attempting to run, if you happen to notice any other errors:

#Set data source, format for use, check consistency
MyData <- c('Create company email', 'email for business', 'free trial', 'corporate pricing', 'email cost')
summary(MyData)


#Define number of clusters
kclusters = round(0.90 * length(unique(MyData)))

#Compute distance between words
uniquedata <- unique(as.character(MyData))
distancemodels <- stringdistmatrix(uniquedata, uniquedata, method="jw")

#Create Dendrogram
rownames(distancemodels) <- uniquedata
hc <- hclust(as.dist(distancemodels))
par(mar = rep(2, 4))
plot(hc)

#Create clusters from grouped keywords
dfClust <- data.frame(uniquedata, cutree(hc, k=kclusters))
names(dfClust) <- c('data','cluster')
plot(table(dfClust$cluster))

#End view
view(dfClust)

Upvotes: 1

Views: 1070

Answers (1)

Peter Smittenaar
Peter Smittenaar

Reputation: 331

I don't know of any way to avoid generating the distance matrix when doing k-means clustering.

You could consider alternative clustering algorithms that have been devised to avoid memory issues. The main one that comes to mind is CLARA (Clustering Large Applications; Kaufman and Rousseeuw 1990). In R, it's as simple as cluster::clara, taking numeric data only (like k-means) and requiring you to set k in advance.

Read the manual (?cluster::clara) especially on number of samples which you should set higher than the default. Hope that helps!

edit: just noticed you don't actually have numeric data to start with, so perhaps CLARA is not all that helpful. You could perhaps use some of the same principles as CLARA, including sampling your data multiple times to reduce the memory footprint and combining results later on.

Upvotes: 1

Related Questions