broccoli
broccoli

Reputation: 4836

R data.table and kmeans clustering

I'm not even sure if this is possible with data.table. I have a data set that looks like the following. Its a data frame, but I later convert to a data.table, called x

id xcord ycord
a  2 3
a  3 4
a  3 3
a  9 10
a  8 9
b  1 3
b  1 2
b  8 19
b  7 21

I want to identify two clusters per id, and that is proving to be difficult. I tried the following

x = x[,list(x1 = kmeans(xcord,centers=2)$centers, y1 = kmeans(ycord,centers=2)$centers,by = id]

but it gave the following error message. All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards. Calls: [ -> [.data.table Execution halted

I'm expecting a data table with entries that can be "treated" as a list of centers. Is this even possible?

Upvotes: 1

Views: 2315

Answers (1)

mnel
mnel

Reputation: 115382

the centers element is a matrix (it will contain as many columns as columns in the x argument to kmeans.

If you want to find the clusters considering xcord and ycord in the same clustering episode you will need to pass a matrix to kmeans. You will then have to coerce back to data.table afterwards. this will keep the names sensibly.

# eg.
fx <- x[,data.table(kmeans(cbind(xcord,ycord),centers=2)$centers),by=id]
fx
#    id    xcord     ycord
# 1:  a 2.666667  3.333333
# 2:  a 8.500000  9.500000
# 3:  b 7.500000 20.000000
# 4:  b 1.000000  2.500000

Upvotes: 4

Related Questions