Maya_Cent
Maya_Cent

Reputation: 481

Grouping datatable based on one column and transposed fields

In a datatable as follow:

CHR      POS          ID         indv1  indv2  indv3  
1   31439_T_A   ENSG00000232325     1    1    2
1   31504_G_A   ENSG00000242474     2    1    1
1   34713_A_C   ENSG00000242474     1    0    1
1   34918_C_T   ENSG00000242474     1    0    1
1   35119_G_A   ENSG00000239715     0    0    1

For each group of unique IDs in column ID, I want to build a dataframe in which the rows are the indv1, indv2, ... and the other columns are the corresponding values of all variants, column POS.

UPDATE: The rows in output will be indvs and columns will be POSs. The output should be prepared separately for each group of IDs in ID column since the number of variations in POS could be different for each IDs so it might be a list.

Any hints or suggestion would be appreciated.

Upvotes: 0

Views: 60

Answers (1)

TarJae
TarJae

Reputation: 79164

Update due to OP comment: Maybe this:

library(tidyr)
library(dplyr)

df %>% 
  group_by(ID) %>% 
  pivot_longer(
    cols = starts_with("indv")
  ) %>% 
  pivot_wider(
    names_from = POS,
    values_from = value
  )
    CHR ID              name  `31439_T_A` `31504_G_A` `34713_A_C` `34918_C_T` `35119_G_A`
  <int> <chr>           <chr>       <int>       <int>       <int>       <int>       <int>
1     1 ENSG00000232325 indv1           1          NA          NA          NA          NA
2     1 ENSG00000232325 indv2           1          NA          NA          NA          NA
3     1 ENSG00000232325 indv3           2          NA          NA          NA          NA
4     1 ENSG00000242474 indv1          NA           2           1           1          NA
5     1 ENSG00000242474 indv2          NA           1           0           0          NA
6     1 ENSG00000242474 indv3          NA           1           1           1          NA
7     1 ENSG00000239715 indv1          NA          NA          NA          NA           0
8     1 ENSG00000239715 indv2          NA          NA          NA          NA           0
9     1 ENSG00000239715 indv3          NA          NA          NA          NA           1

First answer: We could use pivot_longer:

library(tidyr)
library(dplyr)

df %>% 
  group_by(ID) %>% 
  pivot_longer(
    cols = starts_with("indv")
  )
     CHR POS       ID              name  value
   <dbl> <chr>     <chr>           <chr> <dbl>
 1     1 31439_T_A ENSG00000232325 indv1     1
 2     1 31439_T_A ENSG00000232325 indv2     1
 3     1 31439_T_A ENSG00000232325 indv3     2
 4     1 31504_G_A ENSG00000242474 indv1     2
 5     1 31504_G_A ENSG00000242474 indv2     1
 6     1 31504_G_A ENSG00000242474 indv3     1
 7     1 34713_A_C ENSG00000242474 indv1     1
 8     1 34713_A_C ENSG00000242474 indv2     0
 9     1 34713_A_C ENSG00000242474 indv3     1
10     1 34918_C_T ENSG00000242474 indv1     1
11     1 34918_C_T ENSG00000242474 indv2     0
12     1 34918_C_T ENSG00000242474 indv3     1
13     1 35119_G_A ENSG00000239715 indv1     0
14     1 35119_G_A ENSG00000239715 indv2     0
15     1 35119_G_A ENSG00000239715 indv3     1

Upvotes: 2

Related Questions