Reputation: 49
I am implementing k-Means. This is my main datastructures:
dt1 is a Data.table with{Filename,featureVector,GroupItBelongsTo}
dt1<- data.table(Filename=files[1:limit],Vector=list(),G=-1)
setkey(dt1,Filename)
featureVector is a list. It has words associated with occurance, I am adding the occurance to each word using this line:
featureVector[[item]] <- emaildt[email==item]$N
A typical excerpt from my console when I call dt1
is.
Filename Vector G
1: 000057219a473629b53d33cfedef590f.txt 1,1,1,1,1,1, 3
2: 00007afb5a5e467a39e517ae87e7fad5.txt 0,0,0,0,0,0, 3
3: 000946d248fdb1d5d05c59a91b00e8f2.txt 0,0,0,0,0,0, 3
4: 000bea8dc6f716a2cac6f25bdbe09073.txt 0,0,0,0,0,0, 3
I now want to compute new centroids for each group number. Meaning I want to sum all vector positions at position 1 with each other, [2] etc.. until the end, and after that - average them all.
Example: v1=[1,1,1], v2=[2,2,2],I would expect the centroid to be = c1=[1,5;1,5;1,5]
I tried to do: sapply(dt1[tt]$Vector,mean) (also tried with "sum") and it sums and "means" row-wise(inside each vector), not column wise(each n-th component) like I would like it to do.
How to do it?
====Update, answering a question in comments====
> head(dt1)
Filename Vector G
1: 000057219a473629b53d33cfedef590f.txt 1,1,1,1,1,1, 1
2: 00007afb5a5e467a39e517ae87e7fad5.txt 0,0,0,0,0,0, 1
3: 000946d248fdb1d5d05c59a91b00e8f2.txt 0,0,0,0,0,0, 3
4: 000bea8dc6f716a2cac6f25bdbe09073.txt 0,0,0,0,0,0, 4
5: 000fcfac9e0a468a27b5e2ad0f78d842.txt 0,0,0,0,0,0, 1
6: 00166a4964d6c939f8f62280b85e706d.txt 0,0,0,1,0,0, 1
> class(dt1)
[1] "data.table" "data.frame"
>
Typing dt1$Vector
gives(I only copied a small sample, it has many more words but they all look the same):
[[1]]
homosexuality articles church people interest
1 1 1 1 1
thread email send warning worth
1 1 1 1 1
And here is the class() output
> class(dt1$Vector)
[1] "list"
Screenshots when typing:
A<-as.matrix(t(as.data.frame(dt1$Vector)))
Result of class(dt1$Vector[[1]])
:
[1] "numeric"
Upvotes: 0
Views: 615
Reputation: 15163
First, (the obligatory) you might consider using the R function kmeans
to do your k-means clustering. If you prefer to roll your own, you can easily compute centroids of a data table as follows. First, I'll build some random data that looks like yours:
> set.seed(123)
> dt<-data.table(name=LETTERS[1:20],replicate(5,sample(0:4,20,T)),G=sample(3,20,T))
> head(dt)
name V1 V2 V3 V4 V5 G
1: A 1 4 0 3 1 2
2: B 3 3 2 0 3 1
3: C 2 3 2 1 2 2
4: D 4 4 1 1 3 3
5: E 4 3 0 4 0 2
6: F 0 3 0 2 2 3
The centroids can be computed in one line:
> dt[,lapply(.SD[,-1],mean),by=G]
G V1 V2 V3 V4 V5
1: 2 2.375000 2.250000 1.25 2.125000 2.250000
2: 1 2.800000 2.400000 2.40 1.800000 1.400000
3: 3 1.714286 2.428571 1.00 2.142857 1.857143
If you're going to do this, you might want to drop the names from the data table (temporarily), in which case you can just do:
> dt2<-copy(dt)
> dt2$name<-NULL
> dt2[,lapply(.SD,mean),by=G]
G V1 V2 V3 V4 V5
1: 2 2.375000 2.250000 1.25 2.125000 2.250000
2: 1 2.800000 2.400000 2.40 1.800000 1.400000
3: 3 1.714286 2.428571 1.00 2.142857 1.857143
Edit: a better way to do this, suggested by @Roland, is to use .SDcols
:
dt[,lapply(.SD,mean),by=G,.SDcols=2:6]
Upvotes: 3