Reputation: 3555
I want to keep a string of character inside a complex string. I think that I can use regex to do keep the thing that I need. Basically, I want to keep only the information between the \"
and \"
in Function=\"SMAD5\"
. I also want to keep the empty strings: Function=\"\"
df=structure(1:6, .Label = c("ID=Gfo_R000001;Source=ENST00000513418;Function=\"SMAD5\";",
"ID=Gfo_R000002;Source=ENSTGUT00000017468;Function=\"CENPA\";",
"ID=Gfo_R000003;Source=ENSGALT00000028134;Function=\"C1QL4\";",
"ID=Gfo_R000004;Source=ENSTGUT00000015300;Function=\"\";", "ID=Gfo_R000005;Source=ENSTGUT00000019268;Function=\"\";",
"ID=Gfo_R000006;Source=ENSTGUT00000019035;Function=\"\";"), class = "factor")
This should look like this:
"SMAD5"
"CENPA"
"C1QL4"
NA
NA
NA
So far that What I was able to do:
gsub('.*Function=\"',"",df)
[1] "SMAD5\";" "CENPA\";" "C1QL4\";" "\";" "\";" "\";"
But I'm stuck with a bunch of \";"
. How can I remove them with one line?
I tried this:
gsub('.*Function=\"' & '.\"*',"",test)
But it's giving me this error:
Error in ".*Function=\"" & ".\"*" :
operations are possible only for numeric, logical or complex types
Upvotes: 3
Views: 1080
Reputation: 121127
The regular expression can be constructed more readably using rebus
.
rx <- 'Function="' %R%
capture(zero_or_more(negated_char_class('"')))
Then matching is as mentioned by Wiktor and sandipan.
rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
str_match(df, rx)
stri_match_first_regex(df, rx)
gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)
Upvotes: 0
Reputation: 23109
With stringr we can capture groups too:
library(stringr)
matches <- str_match(df, ".*\"(.*)\".*")[,2]
ifelse(matches=='', NA, matches)
# [1] "SMAD5" "CENPA" "C1QL4" NA NA NA
Upvotes: 1
Reputation: 627082
You may use
gsub(".*Function=\"([^\"]*).*","\\1",df)
See the regex demo
Details:
.*
- any 0+ chars as many as possible up to the last...Function=\"
- a Function="
substring([^\"]*)
- capturing group 1 matching 0+ chars other than a "
.*
- and the rest of the string.The \1
is the backreference restoring the contents of the Group 1 in the result.
Upvotes: 2