René Nyffenegger
René Nyffenegger

Reputation: 40499

sorting the output of dist()

I have a matrix m

m <- matrix ( 
  c( 2, 1, 8, 5,
     7, 6, 3, 4,
     9, 3, 2, 8,
     1, 3, 7, 4),
  nrow  = 4,
  ncol  = 4,
  byrow = TRUE)

rownames(m) <- c('A', 'B', 'C', 'D')

Now, I'd like to order the rows of m based on their respective distance, so I use dist()

dist_m <- dist(m)

dist_m is, when printed

          A         B         C
B  8.717798
C  9.899495  5.477226
D  2.645751  7.810250 10.246951

Since I want it ordered, I try sort(dist_m) which prints

[1]  2.645751  5.477226  7.810250  8.717798  9.899495 10.246951

Which is almost what I want. But I'd be more happy if it also printed the names of the two rows of which a number is the distance, something like

 2.645751  A  D
 5.477226  B  C
 7.810250  B  D
 8.717798  A  B
 9.899495  A  C
10.246951  C  D

This is certainly possible, but I have no idea how I could achieve this.

Upvotes: 4

Views: 1861

Answers (3)

Marco Pessoa
Marco Pessoa

Reputation: 11

If you do have distance values = 0 in your dist object

I started using the solution posted by akrun to sort the output of a dist object, but in my case, I do have distance values = 0. To avoid discarding these with the subset step, I first converted the upper triangle to NA, an then the diagonal to NA as well, using diag (actually obtained a symmetric matrix from another program). Finally, instead of subset, I used melt, na.omit, and order:

library(reshape2)

#create matrix
 m <- matrix ( 
 c( 2, 1, 8, 5,
    2, 1, 8, 5,
    9, 3, 2, 8,
    1, 3, 7, 4),
    nrow  = 4,
    ncol  = 4,
    byrow = TRUE)

rownames(m) <- c('A', 'B', 'C', 'D')

# use dist
dist_m <- dist(m)
dist_m 

# A and B are identical
             A         B         C
B  0.000000                    
C  9.899495  9.899495          
D  2.645751  2.645751 10.246951

m1 <- as.matrix(dist_m)
m1[upper.tri(m1)] <- NA
diag(m1) <- NA
m2 <- melt(m1)
na.omit(m2[order(m2$value),3:1])

As a result, the pairwise distance value between A and B is preserved:

       value Var2 Var1
2   0.000000    A    B
4   2.645751    A    D
8   2.645751    B    D
3   9.899495    A    C
7   9.899495    B    C
12 10.246951    C    D

Upvotes: 1

ARobertson
ARobertson

Reputation: 2897

Using base R:

dm <- as.matrix(dist_m)
df <- data.frame(data = c(dm),
                 column = c(col(dm)),
                 row = c(row(dm)))

# get only one triangle
df <- df[df$row > df$column, ]

# put in order
df[order(df$data), ]

# for letters, add this
df$row <- LETTERS[df$row]
df$column <- LETTERS[df$column]

Upvotes: 0

akrun
akrun

Reputation: 887088

One option would be to convert the dist to matrix, replace the upper triangle values as 0, melt, subset the non-zero values, and then order based on the 'value' column.

m1 <- as.matrix(dist_m)
m1[upper.tri(m1)] <- 0
library(reshape2)
m2 <- subset(melt(m1), value!=0)
m2[order(m2$value),3:1]
#         value Var2 Var1
#4   2.645751    A    D
#7   5.477226    B    C
#8   7.810250    B    D
#2   8.717798    A    B
#3   9.899495    A    C
#12 10.246951    C    D

Or a base R option suggested by @David Arenburg after getting the 'm1'

 m2 <- cbind(which(m1!=0, arr.ind=TRUE), value= m1[m1!=0])
 m2[order(m2[,'value']),]

Upvotes: 4

Related Questions