Reputation: 663
I have data next structure
> head(all_data[, c('person_id', 'gene', 'genotype', 'group')], 20)
person_id gene genotype group
1 58 ACE_AluIns/Del I/D 1
2 58 eNOS_4a/4b 4b/4b 1
3 58 eNOS_G894T G/T 1
4 58 F1_Thr312Ala Thr/Thr 1
5 58 F13_Val34Leu Val/Val 1
6 58 F2_G20210А G/G 1
7 58 F5_Arg506Gln G/G 1
8 58 MTHFR_А1298С A/C 1
9 58 MTHFR_С677Т C/T 1
10 58 PAI-1_4G/5G 5G/5G 1
11 110 ACE_AluIns/Del I/D 1
12 110 eNOS_4a/4b 4b/4b 1
13 110 eNOS_G894T T/T 1
14 110 F1_Thr312Ala Thr/Thr 1
15 110 F13_Val34Leu Val/Leu 1
16 110 F2_G20210А G/G 1
17 110 F5_Arg506Gln G/G 1
18 110 MTHFR_А1298С C/C 1
19 110 MTHFR_С677Т C/C 1
20 58 PAI-1_4G/5G 5G/5G 1
In total ~50 000 rows in test dataset. Each row represent results of genetic alalisys of some persons. The rows belong to the same person_id
- results of one person.
I need to create summary table:
person_id
and group
gene
genotype
Example
> head(summarytable[,1:5])
person_id ACE_AluIns/Del eNOS_4a/4b eNOS_G894T F1_Thr312Ala
1 58 --- --- --- ---
2 110 I/D 4b/4b T/T Thr/Thr
3 117 I/D 4b/4a G/G Thr/Ala
4 143 --- --- --- ---
5 148 --- --- --- ---
6 156 <NA> <NA> <NA> <NA>
Some important notes:
person_id
) may have two genetic analysis results (genotype
) for the same gene
. In first example you can see person_id==58 & gene=='PAI-1_4G/5G'
have 2 results (rows 10 and 20). in this case need to set '---'
as value in summary_table
My attempt to solve the problem is presented below.
#create test dataframe
data = data.frame(person_id = rep(1:10, each=10),
gene = paste('gene_', rep(1:10, 10)),
genotype = paste('genotype_', rep(1:10, 10)),
group = 1
)
data[11,1] = 1 # change one value (must be --- in future)
data = data[-c(15, 22, 50), ] #remove some rows (must be NA in future)
#create new dataframe for summary table
sunnary_table = data.frame(person_id = unique(data$person_id))
#try to create summary table
for (index in 1:nrow(data)) {
row = unlist(data[index, ])
if(is.null(sunnary_table[sunnary_table$person_id == row['person_id'], row['gene']])) {
sunnary_table[sunnary_table$person_id == row['person_id'], row['gene']] = NA
}
if(is.na(sunnary_table[sunnary_table$person_id == row['person_id'], row['gene']])) {
sunnary_table[sunnary_table$person_id == row['person_id'], row['gene']] = row['genotype']
} else {
sunnary_table[sunnary_table$person_id == row['person_id'], row['gene']] = '---'
}
}
But it is very slow event when try handle 50000 rows. How can I solve my problem more efficiently?
Upvotes: 1
Views: 84
Reputation: 1241
using data.table
you shoud be more efficient.
library(dplyr)
library(data.table)
## convert to data table
all_data <- all_data %>% as.data.table() ## conv to data table
## recognaize multiple results
all_data$dupl <- duplicated(all_data , fromLast=TRUE)
## get rid of multiple results
all_data <- all_data %>%
distinct(across(c('person_id', 'gene', 'group', 'genotype')), .keep_all = TRUE) %>%
mutate(genotype = ifelse(dupl, "----", genotype),
dupl = NULL)
## generate desired output
dcast(person_id + group ~ gene, data = all_data , value.var = 'genotype')
Upvotes: 1