Reputation: 25
Reproducible data:
Data <- data.frame(
X = sample(c(0,1), 10, replace = TRUE),
Y = sample(c(0,1), 10, replace = TRUE),
Z = sample(c(0,1), 10, replace = TRUE)
)
Matrix_from_Data <- data.matrix(Data)
str(Matrix_from_Data)
num [1:10, 1:3] 1 0 0 1 0 1 0 1 1 1 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:3] "X" "Y" "Z"
The question: I have dataframe of binary, symmetric variables (larger than the example), and I'd like to do some hierarchical clustering, which I've never tried before. There are no missing or NA values.
I convert the dataframe into a matrix before attempting to run the daisy function from the 'cluster' package, to get the dissimilarity matrix. I'd like to explore the options for calculating different dissimilarity metrics, but am running into a warning (not an error):
library(cluster)
Dissim_Euc_Matrix_from_Data <- daisy(Matrix_from_Data, metric = "euclidean", type = list(symm =c(1:ncol(Matrix_from_Data))))
Warning message: In daisy(Matrix_from_Data, metric = "euclidean", type = list(symm = c(1:ncol(Matrix_from_Data)))) : with mixed variables, metric "gower" is used automatically
...which seems weird to me, since "Matrix_from_Data" is all numeric variables, not mixed variables. Gower might be a fine metric, but I'd like to see how the others impact the clustering. What am I missing?
Upvotes: 0
Views: 1304
Reputation: 338
Great question.
First, that message is a Warning
and not an Error
. I'm not personally familiar with daisy
, but my ignorant guess is that that particular warning message pops up when you run the function and doesn't do any work to see if the warning is relevant.
Regardless of why that warning appears, one simple way to compare the clustering done by several different distances measures in hierarchical clustering is to plot the dendograms. For simplicity, let's compare the "euclidean"
and "binary"
distance metrics programmed into dist
. You can use ?dist
to read up on what the "binary"
distance means here.
# When generating random data, always set a seed if you want your data to be reproducible
set.seed(1)
Data <- data.frame(
X = sample(c(0,1), 10, replace = TRUE),
Y = sample(c(0,1), 10, replace = TRUE),
Z = sample(c(0,1), 10, replace = TRUE)
)
# Create distance matrices
mat_euc <- dist(Data, method="euclidean")
mat_bin <- dist(Data, method="binary")
# Plot the dendograms side-by-side
par(mfrow=c(1,2))
plot(hclust(mat_euc))
plot(hclust(mat_bin))
I generally read dendograms from the bottom-up since points lower on the vertical axis are more similar (i.e. less distant) to one another than points higher on the vertical axis.
We can pick up a few things from these plots:
Also remember that there are different methods of hierarchical clustering (e.g. complete linkage and single linkage), but you can use this same approach to compare the differences between methods as well. See ?hclust
for a complete list of methods provided by hclust
.
Hope that helps!
Upvotes: 2