Create somthing like distance matrix

Question

I have a 3-column matrix which contains IDs and score between them.

> df1
           ID_1       ID_2    Score
  1: 1000020760 1000534822 1.70e-01
  2: 1000020760 1000535109 1.10e+00
  3: 1000020760 1000532510 3.20e+00
  4: 1000020760 1000535228 3.70e+00
  5: 1000035849 1000532512 2.49e-60
 ---                               
600: 1000773219 1000483302 2.40e+00
601: 1000773219 1000734829 8.50e+00
602: 1000773219 1000535109 1.00e+01
603: 1000773304 1000730133 5.80e+00
604: 1000773304 1000752994 9.20e+00

Each value from the first column has several corresponding values from the second (including the lack of values). The value from the third column does not commute over the first two. For example:

> df1[4,]
         ID_1       ID_2 Score
1: 1000020760 1000535228   3.7

> df1[294,]
         ID_1       ID_2 Score
1: 1000535228 1000020760   3.9

Identical IDs, but standing in different places, have different scores.

I would like to create a matrix m * m (m is the max of numbers of unique IDs in df1$ID_1 and df1$ID_2) which elements are the corresponding values from the third column of the input table. The rows and columns of the output matrix must consist of IDs, and at the intersection -- the corresponding values from the Score column of the input matrix. The values on the main diagonal must equal to 0, and the missing values must equal toNA.

For exmaple, if i have input like this

> tmp1
         ID_1       ID_2 Score
1: 1000020760 1000534822  0.17
2: 1000020760 1000535109  1.10
3: 1000020760 1000532510  3.20
4: 1000020760 1000535228  3.70
5: 1000535228 1000483302  2.50
6: 1000535228 1000020760  3.90
7: 1000535228 1000414853  5.10
8: 1000534822 1000020760  0.06

i'd like to see output like this

> tmp_mat
           1000534822 1000535109 1000532510 1000535228 1000483302 1000020760 1000414853
1000534822       0.00         NA         NA         NA         NA       0.06         NA
1000535109         NA        0.0         NA         NA         NA         NA         NA
1000532510         NA         NA        0.0         NA         NA         NA         NA
1000535228         NA         NA         NA        0.0        2.5       3.90        5.1
1000483302         NA         NA         NA         NA        0.0         NA         NA
1000020760       0.17        1.1        3.2        3.7         NA       0.00         NA
1000414853         NA         NA         NA         NA         NA         NA        0.0

WaltS · Accepted Answer

A vector containing the unique values of ID_1 and ID_2 is expanded to get all combinations. Scores are then included where available and the diagonal elements zeroed. Finally the results can be reformatted as a wide tibble. Using the tidyverse library, the code would be

library(tidyverse)

ID_table <- unique(x = c(df1$ID_1, df1$ID_2)) %>% sort() %>% expand_grid(ID_1 = ., ID_2 = .) %>%
            left_join(df1, by = c("ID_1", "ID_2")) %>% mutate(Score = ifelse(ID_1 == ID_2, 0, Score)) %>%
            pivot_wider(names_from = ID_2, values_from = Score)

with the result

# A tibble: 7 x 8
        ID_1 `1000020760` `1000414853` `1000483302` `1000532510` `1000534822` `1000535109` `1000535228`
                                                               
1 1000020760         0            NA           NA            3.2         0.17          1.1          3.7
2 1000414853        NA             0           NA           NA          NA            NA           NA  
3 1000483302        NA            NA            0           NA          NA            NA           NA  
4 1000532510        NA            NA           NA            0          NA            NA           NA  
5 1000534822         0.06         NA           NA           NA           0            NA           NA  
6 1000535109        NA            NA           NA           NA          NA             0           NA  
7 1000535228         3.9           5.1          2.5         NA          NA            NA            0

Create somthing like distance matrix

Answers (1)

Related Questions