autumnrustle
autumnrustle

Reputation: 663

How to create summary table in R

I have data next structure

>   head(all_data[, c('person_id', 'gene', 'genotype', 'group')], 20)
   person_id           gene genotype group
1         58 ACE_AluIns/Del      I/D     1
2         58     eNOS_4a/4b    4b/4b     1
3         58     eNOS_G894T      G/T     1
4         58   F1_Thr312Ala  Thr/Thr     1
5         58   F13_Val34Leu  Val/Val     1
6         58     F2_G20210А      G/G     1
7         58   F5_Arg506Gln      G/G     1
8         58   MTHFR_А1298С      A/C     1
9         58    MTHFR_С677Т      C/T     1
10        58    PAI-1_4G/5G    5G/5G     1
11       110 ACE_AluIns/Del      I/D     1
12       110     eNOS_4a/4b    4b/4b     1
13       110     eNOS_G894T      T/T     1
14       110   F1_Thr312Ala  Thr/Thr     1
15       110   F13_Val34Leu  Val/Leu     1
16       110     F2_G20210А      G/G     1
17       110   F5_Arg506Gln      G/G     1
18       110   MTHFR_А1298С      C/C     1
19       110    MTHFR_С677Т      C/C     1
20       58    PAI-1_4G/5G    5G/5G      1

In total ~50 000 rows in test dataset. Each row represent results of genetic alalisys of some persons. The rows belong to the same person_id - results of one person.

I need to create summary table:

Example

> head(summarytable[,1:5])
  person_id ACE_AluIns/Del eNOS_4a/4b eNOS_G894T F1_Thr312Ala
1        58            ---        ---        ---          ---
2       110            I/D      4b/4b        T/T      Thr/Thr
3       117            I/D      4b/4a        G/G      Thr/Ala
4       143            ---        ---        ---          ---
5       148            ---        ---        ---          ---
6       156           <NA>       <NA>       <NA>         <NA>

Some important notes:

My attempt to solve the problem is presented below.

#create test dataframe
data = data.frame(person_id = rep(1:10, each=10),
                      gene = paste('gene_', rep(1:10, 10)),
                      genotype = paste('genotype_', rep(1:10, 10)),
                      group = 1
                      )
data[11,1] =  1 # change one value (must be --- in future)
data = data[-c(15, 22, 50), ] #remove some rows (must be NA in future)
  

#create new dataframe for summary table
sunnary_table = data.frame(person_id = unique(data$person_id))


#try to create summary table
for (index in 1:nrow(data)) {
  row = unlist(data[index, ])
  
  if(is.null(sunnary_table[sunnary_table$person_id == row['person_id'], row['gene']])) {
    sunnary_table[sunnary_table$person_id == row['person_id'], row['gene']] = NA
  }
  
  if(is.na(sunnary_table[sunnary_table$person_id == row['person_id'], row['gene']])) {
    sunnary_table[sunnary_table$person_id == row['person_id'], row['gene']] = row['genotype']
  } else {
    sunnary_table[sunnary_table$person_id == row['person_id'], row['gene']] = '---'
  }
  
}

But it is very slow event when try handle 50000 rows. How can I solve my problem more efficiently?

Upvotes: 1

Views: 84

Answers (1)

dy_by
dy_by

Reputation: 1241

using data.table you shoud be more efficient.

library(dplyr)
library(data.table)

## convert to data table
all_data <- all_data %>% as.data.table() ## conv to data table

## recognaize multiple results
all_data$dupl <- duplicated(all_data , fromLast=TRUE) 

## get rid of multiple results
all_data <- all_data %>%  
  distinct(across(c('person_id', 'gene', 'group', 'genotype')), .keep_all = TRUE) %>%  
  mutate(genotype = ifelse(dupl, "----", genotype),
         dupl = NULL)

## generate desired output
dcast(person_id + group ~ gene, data = all_data , value.var = 'genotype')

Upvotes: 1

Related Questions