user15169542
user15169542

Reputation: 11

R: How to recode values based on first characters 0|0 vs 1|0 vs 0|1 vs 1|1

Thank you in advance for the help.

I am trying to recode a genetic database that contains genotypes coded in VCF format. For context, the VCF format is coded in this format: '0|0:0,0:0:1,0,0'. The main thing I am interested in is the first two(/three if including the |) characters: 0|0:0,0:0:1,0,0. If these are 0|0, it means that the person has two dominant alleles. IF these are 1|1, two recessive alleles. 1|0 and 0|1 are a mix of the two.

I am working on a data frame called "gg" that contains approx 120 columns (one for each SNP) and 1500 rows (one for each subject in the study).

I am trying to recode the SNP from its current format to a more easily analysable format:

I have attempted several approaches. The latest thing I have attempted has got close-ish. I tried the following:

gg[grep("0|0", gg)] <- "0"

Weirdly this makes all the values for the WHOLE database 0's. I think this is because it is interpreting the 0|0 as 'if the value contains a zero or a zero, recode as zero' (and all values contain at least one zero).

What I want to convey is to recode as 1 if the value starts with the EXACT characters 0|0, recode as 1 if it starts with the EXACT characters of 0|1 or 1|0, recode as 2 if it starts with the EXACT character of 1|1

Upvotes: 1

Views: 184

Answers (2)

akrun
akrun

Reputation: 887078

A slightly modified option is

rowSums(read.csv(text = sub("^(\\d)\\|?(\\d).*", "\\1,\\2", gg), 
         header = FALSE) == 1)
#[1] 0 1 1 2

data

gg <- c('0|0:0,0:0:1,0,0','10:0,0:0:1,0,0','0|1:0,0:0:1,0,0','11:0,0:0:1,0,0')

Upvotes: 0

ThomasIsCoding
ThomasIsCoding

Reputation: 101317

Try the code below

colSums(list2DF(strsplit(substr(gsub("\\|","",gg),1,2),""))=="1")

which gives

0 1 1 2

Dummy Data

gg <- c('0|0:0,0:0:1,0,0','10:0,0:0:1,0,0','0|1:0,0:0:1,0,0','11:0,0:0:1,0,0')

Upvotes: 0

Related Questions