Reputation: 55
I would like to extract the character around a symbol using R and sub. I have tried many regular expression but I'm not getting what I want.
My vector:
c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")
I only need one character before and after the >
.
My best try was:
sub("(.*?)>", ">", aa, perl = TRUE)
Upvotes: 2
Views: 508
Reputation: 3525
It looks like you are trying to get the reference and alternate alleles? Only looking for one character suggests you are only interested in SNPs? You could use strsplit to generate a data frame of ref and alt alleles.
test <- c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")
Alleles <- data.frame(t(data.frame(sapply(test, function(x) strsplit(x,split=">")))),row.names=NULL,stringsAsFactors=F)
colnames(Alleles) <- c("Ref","Alt")
Alleles$bases <- apply(Alleles,1,function(x) sum(length(unlist(strsplit(x[1],split=""))),length(unlist(strsplit(x[2],split="")))))
SNPs <- Alleles[Alleles$bases == 2,]
Just taking a single base either side of the replace (>) is going to give you wrong genetic information. The variant "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C" would get reduced to "A>C" - it looks like a simple SNP but is the same as a deletion of the last 38 bases "CGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>-".
Is this what you were after?
Upvotes: 0
Reputation: 46856
Provide a reproducible example
> x = c("A>G", "AT>GC")
Find the index of the symbol you're interested in (use fixed=TRUE
because you're not actually looking for a regular expression).
> i = regexpr(">", x, fixed=TRUE)
Then extract the preceding and / or following character
> substr(x, i-1, i-1)
[1] "A" "T"
> substr(x, i+1, i+1)
[1] "G" "G"
or get the sequence
> substr(x, i-1, i+1)
[1] "A>G" "T>G"
Maybe your reproducible example includes edge cases
> x = c("A>G", "AT>GC", "", ">G", "A>", ">", NA)
and then more processing is needed?
Upvotes: 5
Reputation: 66834
You need to use capture groups in your regex:
vec <- c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")
> sub(".*(.)>(.).*","\\1\\2",vec)
[1] "GG" "TA" "GA" "GA" "AT" "TC" "TC" "TC" "AT" "TC" "TA" "AG" "AC" "CT" "TA"
[16] "TC" "TG" "GC" "TG" "TA" "GA"
In words the regex matches anything zero or more times .*
then capture the next character (.)
then match the greater than sign >
then capture the next character (.)
and then match anything zero or more times at the end .*
. Replace all of this with the two captured characters \\1\\2
.
Upvotes: 9