Komal Rathi
Komal Rathi

Reputation: 4274

How to get pattern between first occurrence of two characters in R?

I am trying to match a pattern: anything that is between VD= and the first occurrence of | from a character string, say tmp, like this:

tmp <- "PC=I;RS=128850544;RE=128850566;LEN=6;S1=36;S2=499.417;REP=2;VT=Ins;VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627;VC=intronic;VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"

gene <- sub("^.*VD=([A-Za-z0-9]+)[|].*", "\\1", tmp)
gene
# [1] "SMO"

But when there is no VD= or | in the string, it grabs the whole string:

tmp <- "PC=D;RS=72450731;RE=72450735;LEN=1;S1=72;S2=802.939;REP=3;VT=Del"

gene <- sub("^.*VD=([A-Za-z0-9]+)[|].*", "\\1", tmp)
gene
# [1] "PC=D;RS=72450731;RE=72450735;LEN=1;S1=72;S2=802.939;REP=3;VT=Del"

I don't understand why it is grabbing the whole string instead of NA even when there are no VD= or | characters present. Is there a way to grab a pattern between the first occurrence of two characters and print it or print NA if the pattern is not found.

Any help would be much appreciated.

Thanks!

Upvotes: 2

Views: 86

Answers (2)

bgoldst
bgoldst

Reputation: 35314

It looks to me like you're effectively trying to parse a multilevel delimited string. I recommend not trying to use a single regex to extract the information you want, but rather using a more rigorous stepwise breakdown of the elements of the syntax.

First, you can split on semicolon to get the top-level pieces that look like variable assignments:

tmp <- 'PC=I;RS=128850544;RE=128850566;LEN=6;S1=36;S2=499.417;REP=2;VT=Ins;VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627;VC=intronic;VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627';
specs <- strsplit(fixed=T,tmp,';')[[1L]];
specs;
##  [1] "PC=I"
##  [2] "RS=128850544"
##  [3] "RE=128850566"
##  [4] "LEN=6"
##  [5] "S1=36"
##  [6] "S2=499.417"
##  [7] "REP=2"
##  [8] "VT=Ins"
##  [9] "VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
## [10] "VC=intronic"
## [11] "VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"

Next you can search for the LHS of interest, extracting just the first occurrence (in case there are multiple matches):

vdspec <- grep(perl=T,value=T,'^VD=',specs)[1L];
vdspec;
## [1] "VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"

You can drill down into the RHS and then split that into the pipe-delimited fields:

vd <- sub(perl=T,'^VD=','',vdspec);
vd;
## [1] "SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
vdfields <- strsplit(fixed=T,vd,'|')[[1L]];
vdfields;
## [1] "SMO"
## [2] "CCDS5811.1"
## [3] "r.?"
## [4] "-"
## [5] "-"
## [6] "protein_coding:CDS:intron:insertion:intron_variant"
## [7] "SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"

Now you can easily get the value you're looking for:

vdfields[1L];
## [1] "SMO"

If your target LHS does not match, you'll get NA from the grep()[1L] call:

xxspec <- grep(perl=T,value=T,'^XX=',specs)[1L];
xxspec;
## [1] NA

Thus you can branch on the result of the grep()[1L] call to handle the case of a missing LHS.

Upvotes: 1

user2705585
user2705585

Reputation:

Your regex seems quite complicated for the task. Using simple regex like this

Regex: VD=([^|]+) would be sufficient. Use \\1 to back-reference.

Explanation: ([^|]+) matches anything from VD= until first | is encountered.

Regex101 Demo

tmp <- c("PC=D;RS=72450731;RE=72450735;LEN=1;S1=72;S2=802.939;REP=3;VT=Del", "PC=I;RS=128850544;RE=128850566;LEN=6;S1=36;S2=499.417;REP=2;VT=Ins;VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627;VC=intronic;VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627")
gsub('VD=([^|]+)|.', '\\1', tmp)
# [1] ""    "SMO"

Upvotes: 4

Related Questions