Reputation: 65
I have a 3-column matrix which contains IDs and score between them.
> df1
ID_1 ID_2 Score
1: 1000020760 1000534822 1.70e-01
2: 1000020760 1000535109 1.10e+00
3: 1000020760 1000532510 3.20e+00
4: 1000020760 1000535228 3.70e+00
5: 1000035849 1000532512 2.49e-60
---
600: 1000773219 1000483302 2.40e+00
601: 1000773219 1000734829 8.50e+00
602: 1000773219 1000535109 1.00e+01
603: 1000773304 1000730133 5.80e+00
604: 1000773304 1000752994 9.20e+00
Each value from the first column has several corresponding values from the second (including the lack of values). The value from the third column does not commute over the first two. For example:
> df1[4,]
ID_1 ID_2 Score
1: 1000020760 1000535228 3.7
> df1[294,]
ID_1 ID_2 Score
1: 1000535228 1000020760 3.9
Identical IDs, but standing in different places, have different scores.
I would like to create a matrix m * m (m is the max of numbers of unique IDs in df1$ID_1
and df1$ID_2
) which elements are the corresponding values from the third column of the input table. The rows and columns of the output matrix must consist of IDs, and at the intersection -- the corresponding values from the Score
column of the input matrix.
The values on the main diagonal must equal to 0, and the missing values must equal toNA
.
For exmaple, if i have input like this
> tmp1
ID_1 ID_2 Score
1: 1000020760 1000534822 0.17
2: 1000020760 1000535109 1.10
3: 1000020760 1000532510 3.20
4: 1000020760 1000535228 3.70
5: 1000535228 1000483302 2.50
6: 1000535228 1000020760 3.90
7: 1000535228 1000414853 5.10
8: 1000534822 1000020760 0.06
i'd like to see output like this
> tmp_mat
1000534822 1000535109 1000532510 1000535228 1000483302 1000020760 1000414853
1000534822 0.00 NA NA NA NA 0.06 NA
1000535109 NA 0.0 NA NA NA NA NA
1000532510 NA NA 0.0 NA NA NA NA
1000535228 NA NA NA 0.0 2.5 3.90 5.1
1000483302 NA NA NA NA 0.0 NA NA
1000020760 0.17 1.1 3.2 3.7 NA 0.00 NA
1000414853 NA NA NA NA NA NA 0.0
Upvotes: 0
Views: 151
Reputation: 5530
A vector containing the unique values of ID_1
and ID_2
is expanded to get all combinations. Scores
are then included where available and the diagonal elements zeroed. Finally the results can be reformatted as a wide tibble
. Using the tidyverse
library, the code would be
library(tidyverse)
ID_table <- unique(x = c(df1$ID_1, df1$ID_2)) %>% sort() %>% expand_grid(ID_1 = ., ID_2 = .) %>%
left_join(df1, by = c("ID_1", "ID_2")) %>% mutate(Score = ifelse(ID_1 == ID_2, 0, Score)) %>%
pivot_wider(names_from = ID_2, values_from = Score)
with the result
# A tibble: 7 x 8
ID_1 `1000020760` `1000414853` `1000483302` `1000532510` `1000534822` `1000535109` `1000535228`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1000020760 0 NA NA 3.2 0.17 1.1 3.7
2 1000414853 NA 0 NA NA NA NA NA
3 1000483302 NA NA 0 NA NA NA NA
4 1000532510 NA NA NA 0 NA NA NA
5 1000534822 0.06 NA NA NA 0 NA NA
6 1000535109 NA NA NA NA NA 0 NA
7 1000535228 3.9 5.1 2.5 NA NA NA 0
Upvotes: 1