LLL
LLL

Reputation: 743

hierarchical clustering with gower distance - hclust() and philentropy::distance()

I've got a mixed data set (categorical and continuous variables) and I'd like to do hierarchical clustering using Gower distance.

I base my code on an example from https://www.r-bloggers.com/hierarchical-clustering-in-r-2/, which uses base R dist() for Euclidean distance. Since dist() doesn't compute Gower distance, I've tried using philentropy::distance() to compute it but it doesn't work.

Thanks for any help!

# Data
data("mtcars")
mtcars$cyl <- as.factor(mtcars$cyl)

# Hierarchical clustering with Euclidean distance - works 
clusters <- hclust(dist(mtcars[, 1:2]))
plot(clusters)

# Hierarchical clustering with Gower distance - doesn't work
library(philentropy)
clusters <- hclust(distance(mtcars[, 1:2], method = "gower"))
plot(clusters)

Upvotes: 0

Views: 5261

Answers (4)

HajkD
HajkD

Reputation: 1

Many thanks for this great question and thanks to all of you who provided excellent answers.

Just to resolve the issue for future readers:

# import example data
data("mtcars")
# store example subset with correct data type 
mtcars_subset <- tibble::tibble(mpg = as.numeric(as.vector(mtcars$mpg)), 
                                cyl = as.numeric(as.vector(mtcars$cyl)), 
                                disp = as.numeric(as.vector(mtcars$disp)))

# transpose data.frame to be conform with philentropy input format
mtcars_subset <- t(mtcars_subset)

# cluster
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower")))
plot(clusters)

# When using the developer version on GitHub you can also specify 'use.row.names = TRUE'
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower", 
use.row.names = TRUE)))
plot(clusters)

As you can see, clustering works perfectly fine now.

The problem is that in the example dataset the column cyl stores factor values and not double values as is required for the philentropy::distance() function. Since the underlying code is written in Rcpp, non-conform data types will cause problems. As noted correctly by Esther, I will implement a better way to check type safety in future versions of the package.

head(tibble::as.tibble(mtcars))

# A tibble: 6 x 11
mpg cyl    disp    hp  drat    wt  qsec    vs    am  gear  carb
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21   6       160   110  3.9   2.62  16.5     0     1     4     4
2  21   6       160   110  3.9   2.88  17.0     0     1     4     4
3  22.8 4       108    93  3.85  2.32  18.6     1     1     4     1
4  21.4 6       258   110  3.08  3.22  19.4     1     0     3     1
5  18.7 8       360   175  3.15  3.44  17.0     0     0     3     2
6  18.1 6       225   105  2.76  3.46  20.2     1     0     3     1

To overcome this limitation, I stored the columns of interest from the mtcars dataset in a separate data.frame/tibble and converted all columns to double values via as.numeric(as.vector(mtcars$mpg)).

The resulting subset data.frame now stores only double values as required.

mtcars_subset

# A tibble: 32 x 3
 mpg   cyl  disp
<dbl> <dbl> <dbl>
1  21       6  160 
2  21       6  160 
3  22.8     4  108 
4  21.4     6  258 
5  18.7     8  360 
6  18.1     6  225 
7  14.3     8  360 
8  24.4     4  147.
9  22.8     4  141.
10  19.2     6  168.
# … with 22 more rows

Please also note that if you provide the philentropy::distance() function only 2 input vectors, then only one distance value will be returned and the hclust() function won't be able to compute any clusters with one value. Hence, I added a third column disp to enable visualization of the clusters.

I hope this helps.

Upvotes: 0

user4117783
user4117783

Reputation:

You can do it pretty efficiently with the gower package

library(gower)

d <- sapply(1:nrow(mtcars), function(i) gower_dist(mtcars[i,],mtcars))
d <- as.dist(d)
h <- hclust(d)
plot(h)

Upvotes: 0

Juan Abasolo
Juan Abasolo

Reputation: 13

LLL; Sorry, I don't know English and I can't explain. Now this is a try. But the code is good ;-)

library(philentropy)
clusters <- hclust(
                   as.dist(
                          distance(mtcars[, 1:2], method = "gower")))
plot(clusters)

Good look

Upvotes: 0

Esther
Esther

Reputation: 1115

The error is in the distance function itself.

I don't know if it's intentional or not, but the current implementation of philentropy::distance with the "gower" method cannot handle any mixed data types, since the first operation is to transpose the data.frame, producing a character matrix which then throws the typing error when passed to the DistMatrixWithoutUnit function.

You might try using the daisy function from cluster instead.

library(cluster)

x <- mtcars[,1:2]

x$cyl <- as.factor(x$cyl)

dist <- daisy(x, metric = "gower")

cls <- hclust(dist)

plot(cls)

EDIT: For future reference it seems like philentropy will be updated to included better type handling in the next version. From the vignette

In future versions of philentropy I will optimize the distance() function so that internal checks for data type correctness and correct input data will take less termination time than the base dist() function.

Upvotes: 2

Related Questions