Reputation: 11
Thank you in advance for the help.
I am trying to recode a genetic database that contains genotypes coded in VCF format. For context, the VCF format is coded in this format: '0|0:0,0:0:1,0,0'. The main thing I am interested in is the first two(/three if including the |) characters: 0|0:0,0:0:1,0,0. If these are 0|0, it means that the person has two dominant alleles. IF these are 1|1, two recessive alleles. 1|0 and 0|1 are a mix of the two.
I am working on a data frame called "gg" that contains approx 120 columns (one for each SNP) and 1500 rows (one for each subject in the study).
I am trying to recode the SNP from its current format to a more easily analysable format:
I have attempted several approaches. The latest thing I have attempted has got close-ish. I tried the following:
gg[grep("0|0", gg)] <- "0"
Weirdly this makes all the values for the WHOLE database 0's. I think this is because it is interpreting the 0|0 as 'if the value contains a zero or a zero, recode as zero' (and all values contain at least one zero).
What I want to convey is to recode as 1 if the value starts with the EXACT characters 0|0, recode as 1 if it starts with the EXACT characters of 0|1 or 1|0, recode as 2 if it starts with the EXACT character of 1|1
Upvotes: 1
Views: 184
Reputation: 887078
A slightly modified option is
rowSums(read.csv(text = sub("^(\\d)\\|?(\\d).*", "\\1,\\2", gg),
header = FALSE) == 1)
#[1] 0 1 1 2
gg <- c('0|0:0,0:0:1,0,0','10:0,0:0:1,0,0','0|1:0,0:0:1,0,0','11:0,0:0:1,0,0')
Upvotes: 0
Reputation: 101317
Try the code below
colSums(list2DF(strsplit(substr(gsub("\\|","",gg),1,2),""))=="1")
which gives
0 1 1 2
Dummy Data
gg <- c('0|0:0,0:0:1,0,0','10:0,0:0:1,0,0','0|1:0,0:0:1,0,0','11:0,0:0:1,0,0')
Upvotes: 0