Reputation: 743
I've got a mixed data set (categorical and continuous variables) and I'd like to do hierarchical clustering using Gower distance.
I base my code on an example from https://www.r-bloggers.com/hierarchical-clustering-in-r-2/, which uses base R dist()
for Euclidean distance. Since dist()
doesn't compute Gower distance, I've tried using philentropy::distance()
to compute it but it doesn't work.
Thanks for any help!
# Data
data("mtcars")
mtcars$cyl <- as.factor(mtcars$cyl)
# Hierarchical clustering with Euclidean distance - works
clusters <- hclust(dist(mtcars[, 1:2]))
plot(clusters)
# Hierarchical clustering with Gower distance - doesn't work
library(philentropy)
clusters <- hclust(distance(mtcars[, 1:2], method = "gower"))
plot(clusters)
Upvotes: 0
Views: 5261
Reputation: 1
Many thanks for this great question and thanks to all of you who provided excellent answers.
Just to resolve the issue for future readers:
# import example data
data("mtcars")
# store example subset with correct data type
mtcars_subset <- tibble::tibble(mpg = as.numeric(as.vector(mtcars$mpg)),
cyl = as.numeric(as.vector(mtcars$cyl)),
disp = as.numeric(as.vector(mtcars$disp)))
# transpose data.frame to be conform with philentropy input format
mtcars_subset <- t(mtcars_subset)
# cluster
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower")))
plot(clusters)
# When using the developer version on GitHub you can also specify 'use.row.names = TRUE'
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower",
use.row.names = TRUE)))
plot(clusters)
As you can see, clustering works perfectly fine now.
The problem is that in the example dataset the column cyl
stores factor
values and not double
values as is required for the philentropy::distance()
function. Since the underlying code is written in Rcpp
, non-conform data types will cause problems. As noted correctly by Esther, I will implement a better way to check type safety in future versions of the package.
head(tibble::as.tibble(mtcars))
# A tibble: 6 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
To overcome this limitation, I stored the columns of interest from the mtcars
dataset in a separate data.frame/tibble and converted all columns to double values via as.numeric(as.vector(mtcars$mpg))
.
The resulting subset data.frame now stores only double
values as required.
mtcars_subset
# A tibble: 32 x 3
mpg cyl disp
<dbl> <dbl> <dbl>
1 21 6 160
2 21 6 160
3 22.8 4 108
4 21.4 6 258
5 18.7 8 360
6 18.1 6 225
7 14.3 8 360
8 24.4 4 147.
9 22.8 4 141.
10 19.2 6 168.
# … with 22 more rows
Please also note that if you provide the philentropy::distance()
function only 2 input vectors, then only one distance value will be returned and the hclust()
function won't be able to compute any clusters with one value. Hence, I added a third column disp
to enable visualization of the clusters.
I hope this helps.
Upvotes: 0
Reputation:
You can do it pretty efficiently with the gower
package
library(gower)
d <- sapply(1:nrow(mtcars), function(i) gower_dist(mtcars[i,],mtcars))
d <- as.dist(d)
h <- hclust(d)
plot(h)
Upvotes: 0
Reputation: 13
LLL; Sorry, I don't know English and I can't explain. Now this is a try. But the code is good ;-)
library(philentropy)
clusters <- hclust(
as.dist(
distance(mtcars[, 1:2], method = "gower")))
plot(clusters)
Good look
Upvotes: 0
Reputation: 1115
The error is in the distance
function itself.
I don't know if it's intentional or not, but the current implementation of philentropy::distance
with the "gower" method cannot handle any mixed data types, since the first operation is to transpose the data.frame, producing a character matrix which then throws the typing error when passed to the DistMatrixWithoutUnit
function.
You might try using the daisy
function from cluster
instead.
library(cluster)
x <- mtcars[,1:2]
x$cyl <- as.factor(x$cyl)
dist <- daisy(x, metric = "gower")
cls <- hclust(dist)
plot(cls)
EDIT: For future reference it seems like philentropy
will be updated to included better type handling in the next version. From the vignette
In future versions of philentropy I will optimize the distance() function so that internal checks for data type correctness and correct input data will take less termination time than the base dist() function.
Upvotes: 2