Reputation: 55

Extract the character around a symbol in R

I would like to extract the character around a symbol using R and sub. I have tried many regular expression but I'm not getting what I want.

My vector:

c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")

I only need one character before and after the >.

My best try was:

sub("(.*?)>", ">", aa, perl = TRUE)

Upvotes: 2

Answers (3)

JeremyS

Reputation: 3525

It looks like you are trying to get the reference and alternate alleles? Only looking for one character suggests you are only interested in SNPs? You could use strsplit to generate a data frame of ref and alt alleles.

test <- c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")
Alleles <- data.frame(t(data.frame(sapply(test, function(x)   strsplit(x,split=">")))),row.names=NULL,stringsAsFactors=F)
colnames(Alleles) <- c("Ref","Alt")
Alleles$bases <- apply(Alleles,1,function(x) sum(length(unlist(strsplit(x[1],split=""))),length(unlist(strsplit(x[2],split="")))))
SNPs <- Alleles[Alleles$bases == 2,]

Just taking a single base either side of the replace (>) is going to give you wrong genetic information. The variant "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C" would get reduced to "A>C" - it looks like a simple SNP but is the same as a deletion of the last 38 bases "CGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>-".

Is this what you were after?

Upvotes: 0

Martin Morgan

Reputation: 46856

Provide a reproducible example

> x = c("A>G", "AT>GC")

Find the index of the symbol you're interested in (use fixed=TRUE because you're not actually looking for a regular expression).

> i = regexpr(">", x, fixed=TRUE)

Then extract the preceding and / or following character

> substr(x, i-1, i-1)
[1] "A" "T"
> substr(x, i+1, i+1)
[1] "G" "G"

or get the sequence

> substr(x, i-1, i+1)
[1] "A>G" "T>G"

Maybe your reproducible example includes edge cases

> x = c("A>G", "AT>GC", "", ">G", "A>", ">", NA)

and then more processing is needed?

Upvotes: 5

James

Reputation: 66834

You need to use capture groups in your regex:

vec <- c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")
> sub(".*(.)>(.).*","\\1\\2",vec)
 [1] "GG" "TA" "GA" "GA" "AT" "TC" "TC" "TC" "AT" "TC" "TA" "AG" "AC" "CT" "TA"
[16] "TC" "TG" "GC" "TG" "TA" "GA"

In words the regex matches anything zero or more times .* then capture the next character (.) then match the greater than sign > then capture the next character (.) and then match anything zero or more times at the end .*. Replace all of this with the two captured characters \\1\\2.

Upvotes: 9

Extract the character around a symbol in R

Answers (3)

Related Questions