Reputation: 4274
I am trying to match a pattern: anything that is between VD=
and the first occurrence of |
from a character string, say tmp
, like this:
tmp <- "PC=I;RS=128850544;RE=128850566;LEN=6;S1=36;S2=499.417;REP=2;VT=Ins;VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627;VC=intronic;VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
gene <- sub("^.*VD=([A-Za-z0-9]+)[|].*", "\\1", tmp)
gene
# [1] "SMO"
But when there is no VD=
or |
in the string, it grabs the whole string:
tmp <- "PC=D;RS=72450731;RE=72450735;LEN=1;S1=72;S2=802.939;REP=3;VT=Del"
gene <- sub("^.*VD=([A-Za-z0-9]+)[|].*", "\\1", tmp)
gene
# [1] "PC=D;RS=72450731;RE=72450735;LEN=1;S1=72;S2=802.939;REP=3;VT=Del"
I don't understand why it is grabbing the whole string instead of NA
even when there are no VD=
or |
characters present. Is there a way to grab a pattern between the first occurrence of two characters and print it or print NA if the pattern is not found.
Any help would be much appreciated.
Thanks!
Upvotes: 2
Views: 86
Reputation: 35314
It looks to me like you're effectively trying to parse a multilevel delimited string. I recommend not trying to use a single regex to extract the information you want, but rather using a more rigorous stepwise breakdown of the elements of the syntax.
First, you can split on semicolon to get the top-level pieces that look like variable assignments:
tmp <- 'PC=I;RS=128850544;RE=128850566;LEN=6;S1=36;S2=499.417;REP=2;VT=Ins;VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627;VC=intronic;VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627';
specs <- strsplit(fixed=T,tmp,';')[[1L]];
specs;
## [1] "PC=I"
## [2] "RS=128850544"
## [3] "RE=128850566"
## [4] "LEN=6"
## [5] "S1=36"
## [6] "S2=499.417"
## [7] "REP=2"
## [8] "VT=Ins"
## [9] "VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
## [10] "VC=intronic"
## [11] "VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
Next you can search for the LHS of interest, extracting just the first occurrence (in case there are multiple matches):
vdspec <- grep(perl=T,value=T,'^VD=',specs)[1L];
vdspec;
## [1] "VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
You can drill down into the RHS and then split that into the pipe-delimited fields:
vd <- sub(perl=T,'^VD=','',vdspec);
vd;
## [1] "SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
vdfields <- strsplit(fixed=T,vd,'|')[[1L]];
vdfields;
## [1] "SMO"
## [2] "CCDS5811.1"
## [3] "r.?"
## [4] "-"
## [5] "-"
## [6] "protein_coding:CDS:intron:insertion:intron_variant"
## [7] "SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
Now you can easily get the value you're looking for:
vdfields[1L];
## [1] "SMO"
If your target LHS does not match, you'll get NA
from the grep()[1L]
call:
xxspec <- grep(perl=T,value=T,'^XX=',specs)[1L];
xxspec;
## [1] NA
Thus you can branch on the result of the grep()[1L]
call to handle the case of a missing LHS.
Upvotes: 1
Reputation:
Your regex seems quite complicated for the task. Using simple regex like this
Regex: VD=([^|]+)
would be sufficient. Use \\1
to back-reference.
Explanation: ([^|]+)
matches anything from VD=
until first |
is encountered.
tmp <- c("PC=D;RS=72450731;RE=72450735;LEN=1;S1=72;S2=802.939;REP=3;VT=Del", "PC=I;RS=128850544;RE=128850566;LEN=6;S1=36;S2=499.417;REP=2;VT=Ins;VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627;VC=intronic;VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627")
gsub('VD=([^|]+)|.', '\\1', tmp)
# [1] "" "SMO"
Upvotes: 4