Reputation: 91
Suppose we have the following functions: euclid
calculates the Euclidean distance, and k_means
implements the full k-means algorithm.
euclid <- function(points1, points2) {
distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
}
distanceMatrix
}
k_means <- function(x, centers, distFun, nItter) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
list(clusters=clusterHistory, centers=centerHistory)
}
test=data # A data.frame
ktest=as.matrix(test) # Turn into a matrix
centers <- ktest[sample(nrow(ktest), 4),] # Sample some centers, 4 for example
result <- k_means(ktest, centers, euclid, 4) # 4 iterations
print(result)
When tested with a matrix of data, the output looks something like:
$clusters
$clusters[[1]]
[1] 1 3 3 1 1 1 1 3 3 2 3 1 1 1 1 1 3 3 1 3 1 1 1 2 1 1 1 1 2 1 1 3 1 1 3 3 1 2 2 1 1 1 2 2 3 2 2 2
[49] 2 2 1 3 1 3 1 3 2 3 1 3 3 2 3 2 1 2 3 1 3 1 1 2 3 1 3 1 3 2 1 3 1 3 2 1 1 2 2 1 1 1 1 1 2 1 3 3
$clusters[[2]]
[1] 1 3 3 1 1 3 1 3 3 2 3 1 1 1 1 1 3 3 1 3 1 3 1 2 1 1 1 1 2 1 1 3 1 1 3 3 1 1 2 1 1 1 3 2 3 2 2 2
[49] 3 2 3 3 1 3 1 3 2 3 1 3 3 2 3 2 3 2 3 1 3 3 1 1 3 1 3 1 3 2 1 3 3 3 3 1 1 2 2 1 3 1 1 1 2 1 3 3
$clusters[[3]]
[1] 1 3 3 1 1 3 1 3 3 2 3 1 1 1 1 1 3 3 1 3 1 3 1 2 1 1 1 1 2 1 1 3 1 1 3 3 1 1 2 1 1 1 3 2 3 2 2 2
[49] 3 2 3 3 1 3 1 3 2 3 1 3 3 2 3 2 3 2 3 1 3 3 1 1 3 1 3 1 3 2 1 3 3 3 3 1 1 2 2 1 3 1 1 1 2 1 3 3
$clusters[[4]]
[1] 1 3 3 1 1 3 1 3 3 2 3 1 1 1 1 1 3 3 1 3 1 3 1 2 1 1 1 1 2 1 1 3 1 1 3 3 1 1 2 1 1 1 3 2 3 2 2 2
[49] 3 2 3 3 1 3 1 3 2 3 1 3 3 2 3 2 3 2 3 1 3 3 1 1 3 1 3 1 3 2 1 3 3 3 3 1 1 2 2 1 3 1 1 1 2 1 3 3
And this continues up to (in this case) 4 iterations specified.
However, I'd like to edit the k_means
function so that it stops when the iteration outputs are the same. You can see here that this occurs at $clusters[[3]]
which is the same as $clusters[[2]]
. However, $clusters[[4]]
is still unnecessarily printed. Can anyone advise where to specifically edit this please?
Upvotes: 0
Views: 40
Reputation: 457
Include a break
statement as follows:
k_means <- function(x, centers, distFun, nItter) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
if(i >1){
if(identical(clusterHistory[[i]], clusterHistory[[i-1]])){break} #Stop if duplicated result
}}
list(clusters=clusterHistory, centers=centerHistory)
}
You can extend it to compare also the centerHistory
if needed
Upvotes: 1