goingdeep
goingdeep

Reputation: 121

Looping through a list of dataframes to return a matrix of k-means clusters with fixed centroids in R

this is my second post, and let's say it predates the first one, which I'll link here:

creating a matrix/dataframe with two for loops in R

I won't repeat the newbie mistake I made there so here you go with a copy of the data:

 > dput(head(dfn,1))
structure(c(-0.936707666207839, 0.684585833497428, -1.15671769161442, 
-0.325882814790034, 0.334512025995239, 0.335054315282587, 0.0671142954097706, 
-0.544867778136127, -0.958378799317135, 1.26734044843021, -0.483611966400142, 
-0.0781514731365092, -0.671994127070641, 0.332218249471269, 0.942550991112822, 
0.15534532610427, 0.192944412985922, 0.206169118270958, 0.424191119850985, 
-0.193936625653784, -0.574273356856365, -0.176553706556564, 0.696013509222779, 
0.118827262744793, 0.0649996884597108, 0.470171960447926, -0.570575475596488, 
0.336490371668436, 0.475005575251838, 0.010357165551236, 0.284525279467858, 
0.523668394513643, -0.0290958105736766, 0.62018540798656, 1.37452329937098, 
0.456726128895017), .Dim = c(1L, 36L), .Dimnames = list(NULL, 
    c("2015-01-30", "2015-02-27", "2015-03-31", "2015-04-30", 
    "2015-05-29", "2015-06-30", "2015-07-31", "2015-08-31", "2015-09-30", 
    "2015-10-30", "2015-11-30", "2015-12-31", "2016-01-29", "2016-02-29", 
    "2016-03-31", "2016-04-29", "2016-05-31", "2016-06-30", "2016-07-29", 
    "2016-08-31", "2016-09-30", "2016-10-31", "2016-11-30", "2016-12-30", 
    "2017-01-31", "2017-02-28", "2017-03-31", "2017-04-28", "2017-05-31", 
    "2017-06-30", "2017-07-31", "2017-08-31", "2017-09-29", "2017-10-31", 
    "2017-11-30", "2017-12-29")))

It is a time series database of 417 rows with 36 time frames (each month for the last 3 years).

Here's the code I used to create a list of dataframes:

ProgrSubset <- function(x,i) { x[,i:sum(i,11)] }
dfList <- lapply(1:25, function(x) ProgrSubset(dfn, x) )

dfList is then a list of 25 dataframes, subsetted from the original one by a rolling window of 12 months.

Now I want to run a k-means algorithm on each dataframe of the list and store the clusters numbers for each iteration in a matrix called it_mat.

But here's the grief, I want the centroids to be the ones of the previous run (if they are fixed from the first run would be great anyway).

I have no problem doing it "by hand":

it_mat <- cbind(ref_data$sec_id)
k = 18
cl <- kmeans(dfList[[1]], centers = k, nstart = 10)
it_mat <- cbind(it_mat, cl$cluster)
head(it_mat) #first iteration

colnames(cl$centers) <- colnames(dfn[,2:13])
k <- cl$centers
cl <- kmeans(dfList[[2]], centers = k, nstart = 10)
it_mat <- cbind(it_mat, cl$cluster)
head(it_mat) #second iteration

It should be then be straightforward to loop it through the list of databases but it is a no show: the for loop I devised only return a matrix with just the first iteration:

it_mat <- cbind(ref_data$sec_id)
for(i in 1:25){
    if(i == 1){
        k = 18
        cl <- kmeans(dfList[[i]], centers = k, nstart = 10)
        it_mat <- cbind(it_mat, cl$cluster)
    }else{
        colnames(cl$centers) <- colnames(dfn[,i:i+11])
        k = cl$centers
        cl <- kmeans(dfList[[i]], centers = k, nstart = 10)
        it_mat <- cbind(it_mat, cl$cluster)
    }
}

Maybe it stops after the error: Error: empty cluster: try a better set of initial centers ?

But I don't care if a cluster is empty.

I've also tried to loop just the subsequent iterations after the first one, to make it simpler without the if and the else:

for(i in 2:25){
    colnames(cl$centers) <- colnames(dfn[,2:13])
    k <- cl$centers
    cl <- kmeans(dfList[[i]], centers = k, nstart = 10)
    it_mat <- cbind(it_mat, cl$cluster)
}

Still the same result: a matrix with just the first iteration.

I've also tried to use it_mat[ ,i] <- cl$clusterinstead of it_mat <- cbind(it_mat, cl$cluster) but it's the same.

I'll appreciate any kind of help, comment or suggestion: I'm probably making some very stupid mistake like in my previous question or I choose a very difficult path complicating my job.

My main goal is to understand how clusters composition variates in certain time series.

Thanks for you time everybody.

Upvotes: 1

Views: 1063

Answers (1)

r2evans
r2evans

Reputation: 160607

Here's a method, though I cannot get it to work with your small dataset and k. Perhaps it'll work better with your actual data. If you don't want to know why/how this works, skip to TL;DR.

Use of Reduce

The trick I'm using is Reduce, whose first argument is a function with two arguments. A trivial demonstration of it is:

Reduce(function(a,b) 2*a+b, 1:4)

This is equivalent to 2*1+2, then 2*(2*1+2)+3, etc. Perhaps uninspiring in its current form. Let's put in some printing, and "accumulate" the data:

Reduce(function(a,b) {
  cat(paste(c(a,b), collapse=","), "\n")
  return(2*a+b)
}, 1:4, accumulate=TRUE)
# 1,2 
# 4,3 
# 11,4 
# [1]  1  4 11 26

So, the first call of the function takes the first element of the vector 1 and the second element 2 and calls the function. Then it takes that returned value (2*1+2 is 4) and the third element of the vector 3 and does its magic. And so on.

One "assumption" typically made when dealing with Reduce is that the two values must be the same "type" of object. This does not need to be, so I'll trick things a little.

Another thing to note is that it is starting on the first two elements of the list, which is also not a strict requirement. If we set init, we can control what a is on the first call.

Reduce(function(a,b) {
  cat(paste(c(a,b), collapse=","), "\n")
  return(2*a+b)
}, 1:4, init=99, accumulate=TRUE)
# 99,1 
# 199,2 
# 400,3 
# 803,4 
# [1]   99  199  400  803 1610

Notice how each element in the list was used in only one function call?

Adding kmeans

So my technique is to think about what we want on the nth call of the function: we want the previous cluster object from n-1 and the nth data. Realize that "previous cluster object" looks a lot like the 199, 400, and 803 from that last example. We'll write a function that assumes the previous cluster object is the first argument, and the data is the second.

my_cascade_kmeans <- function(prevclust, dat) {
  kmeans(dat, centers = prevclust$centers, nstart = 10)
}
Reduce(my_cascade_kmeans, dfList, accumulate = TRUE)

(BTW: I'm collecting the entire cluster output instead of just the centers, since ultimately we want to end up with a list of cluster objects.)

The problem, as you'll quickly find out (and recall), is that the first time this is called, it is called with the first two elements. So instead, we want to declare the initial value. Two ways to handle that:

  1. Reduce(my_cascade_kmeans, dfList, init=list(centers=5), accumulate=TRUE)

    This is using the convenience that both the cluster object from kmeans and a static list(centers=5) can be indexed with $centers, and they return what I think we need.

  2. Reduce(my_cascade_kmeans, dfList, init=NULL, accumulate=TRUE)

    For this to work, we'd need to modify our function to expect NULL in prevclust and deal with it accordingly. There are times when this may be better.

I prefer option 1, because it places the "default k value" in the original Reduce call and not necessarily buried in the function code. But you may prefer it there, over to you.

For this answer, I'm reducing the initial clusters from 18 to 4 ... anything higher and it fails with Error: empty cluster: try a better set of initial centers, which I'm guessing is due to a truncated sample dataset.

TL;DR

my_cascade_kmeans <- function(prevclust, dat) {
  kmeans(dat, centers = prevclust$centers, nstart = 10)
}
clusters <- Reduce(my_cascade_kmeans, dfList, init = list(centers=4), accumulate = TRUE)

length(clusters)
# [1] 26

You might balk at this, but this is what we told it to do: "initialize the vector by prepending list(centers=4) to the beginning, and then accumulate the results", so we should not be surprised that it is one-longer than what we started with.

clusters[[1]]
# $centers
# [1] 4

That confirms it. Clean it up with

clusters <- clusters[-1]

Now each of clusters is the return from kmeans(...) using the previous

clusters[[1]]
# K-means clustering with 4 clusters of sizes 2, 4, 3, 3
# Cluster means:
#         [,1]
# 1  0.9759631
# 2  0.1646323
# 3 -0.4514542
# 4 -1.0172681
# Clustering vector:
# 2015-01-30 2015-02-27 2015-03-31 2015-04-30 2015-05-29 2015-06-30 2015-07-31 2015-08-31 2015-09-30 2015-10-30 2015-11-30 
#          4          1          4          3          2          2          2          3          4          1          3 
# 2015-12-31 
#          2 
# Within cluster sum of squares by cluster:
# [1] 0.16980147 0.12635651 0.02552839 0.02940412
#  (between_SS / total_SS =  94.0 %)
# Available components:
# [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"        
# [9] "ifault"      

Icing on the cake, this works as well with 2 or 2000 datasets.

Upvotes: 1

Related Questions