Reputation: 143
I need to calculate jaccard distance between each row in a data frame. the return need to be a matrix/data frame that represent the distance.
like this:
1 2 3 ..
1 0 0.2 1
2 0.2 0 0.4
3 1 0.4 0
.
.
my data:
dput(items[1:10,])
structure(list(Drama = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), Comedy = c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), Crime = c(0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), SciFi = c(1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L), Kids = c(1L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L), Classic = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L,
0L), Foreign = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), Thriller = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Action = c(0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), Adventure = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), Animation = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), Adult = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), History = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Documentry = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), Nature = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), Horror = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L), Show = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), Series = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), BlackWhite = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Drama", "Comedy", "Crime",
"SciFi", "Kids", "Classic", "Foreign", "Thriller", "Action",
"Adventure", "Animation", "Adult", "History", "Documentry", "Nature",
"Horror", "Show", "Series", "BlackWhite"), row.names = c(NA,
10L), class = "data.frame")
my code:
Jaccard_dist <- dist(items, items, method = "Jaccard")
write.csv(Jaccard_dist,'Jaccard_dist.csv')
do you know of a way to do this without using two for-loops?
Upvotes: 5
Views: 11173
Reputation: 63
It seems that the "binary" method of R's native dist() function does in fact provide the Jaccard distance without naming it specifically. The description fits ("The vectors are regarded as binary bits, so non-zero elements are ‘on’ and zero elements are ‘off’. The distance is the proportion of bits in which only one is on amongst those in which at least one is on.") and so does the output (exactly the same as in the accepted answer):
> dist(data, method = "binary")
1 2 3 4 5 6 7 8 9
2 1.0000000
3 1.0000000 0.6666667
4 0.8000000 0.8000000 1.0000000
5 1.0000000 0.8000000 0.6666667 0.8000000
6 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667
7 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000
8 0.5000000 1.0000000 1.0000000 0.5000000 0.8000000 0.6666667 0.7500000
9 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 0.0000000 0.5000000 0.6666667
10 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000 0.6666667 0.7500000 0.5000000
Upvotes: 5
Reputation: 2496
Not sure why you need two for loops.
You can try the library proxy
and use:
proxy::dist(dft, by_rows = TRUE, method = "Jaccard")
This returns:
#
1 2 3 4 5 6 7 8 9
#2 1.0000000
#3 1.0000000 0.6666667
#4 0.8000000 0.8000000 1.0000000
#5 1.0000000 0.8000000 0.6666667 0.8000000
#6 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667
#7 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000
#8 0.5000000 1.0000000 1.0000000 0.5000000 0.8000000 0.6666667 0.7500000
#9 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 0.0000000 0.5000000 0.6666667
#10 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000 0.6666667 0.7500000 0.5000000
Upvotes: 5