Reputation: 1686
I have a vector of strings that look like this:
G30(H).G3(M).G0(L).Replicate(1)
Iterating over c("H", "M", "L")
, I would like to extract G30
(for "H
"), G3
(for "M
") and G0
(for "L
").
My various attempts have me confused - the regex101.com
debugger, e.g. indicates that (\w*)\(M\)
works just fine, but transferring that to R fails ...
Upvotes: 0
Views: 203
Reputation: 5273
Using the stringi
package and the outer()
function:
library(stringi)
strings <- c(
"G30(H).G3(M).G0(L).Replicate(1)",
"G5(M).G11(L).G6(H).Replicate(9)",
"G10(M).G6(H).G8(M).Replicate(200)" # No "L", repeated "M"
)
targets <- c("H", "M", "L")
patterns <- paste0("\\w+(?=\\(", targets, "\\))")
matches <- outer(strings, patterns, FUN = stri_extract_first_regex)
colnames(matches) <- targets
matches
# H M L
# [1,] "G30" "G3" "G0"
# [2,] "G6" "G5" "G11"
# [3,] "G6" "G10" NA
This ignores any instances of a target letter past the first, gives you an NA
when the target's not found, and returns everything in a simple matrix. The regular expressions stored in patterns
match substrings like XX(Y)
, where Y
is the target letter and XX
is any number of word characters.
Upvotes: 2
Reputation: 11762
I am pretty sure there are better solutions, but this works...
jnk <- 'G30(H).G3(M).G0(L).Replicate(1)'
patter <- '([^\\(]+)\\(H\\)\\.([^\\(]+)\\(M\\)\\.([^\\(]+)\\(L\\)\\.Replicate\\(\\d+\\)'
H <- sub(patter, '\\1', jnk)
M <- sub(patter, '\\2', jnk)
L <- sub(patter, '\\3', jnk)
EDIT:
Actually, I found once a very nice function parse.one
which makes it possible to search more in a python like regular expression way...
Have a look at this:
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
jnk <- 'G30(H).G3(M).G0(L).Replicate(1)'
pattern <- '(?<H>[^\\(]+)\\(H\\)\\.(?<M>[^\\(]+)\\(M\\)\\.(?<L>[^\\(]+)\\(L\\)\\.Replicate\\(\\d+\\)'
parse.one(jnk, regexpr(pattern, jnk, perl=TRUE))
Result looks like this:
> parse.one(jnk, regexpr(pattern, jnk, perl=TRUE))
H M L
[1,] "G30" "G3" "G0"
Upvotes: 1
Reputation: 1975
If codes (e.g., 'G30') preceding the tags(e.g., '(H).') or the order of the tags in the string are allowed to change (different letters or length), you may want to try a more flexible solution based on regexpr().
aa <-paste("G30(H).G3(M).G0(L).Replicate(",1:10,")", sep="")
my.tags <- c("H","M", "L")
extr.data <- lapply(my.tags, (function(tag){
pat <- paste("\\(", tag, "\\)\\.", sep="")
pos <- regexpr(paste("(^|\\.)([[:alnum:]])*", pat ,sep=""), aa)
out <- substr(aa, pos, (pos+attributes(pos)$match.length - 4 - length(tag)))
gsub("(^\\.)", "", out)
}))
names(extr.data) <- my.tags
extr.data
Upvotes: 1
Reputation: 4537
I'm going to assume that the functions (G...) are variable and the inputs are variable. This does assume that your functions start with a G and your input is always a letter.
parse = function(arb){
tmp = stringi::stri_extract_all_regex(arb,"G.*?\\([A-Z]\\)")[[1]]
unlist(lapply(lapply(tmp,strsplit,"\\)|\\("),function(x){
output = x[[1]][1]
names(output) = x[[1]][2]
return(output)
}))
}
This first parses out all the G functions with their inputs. Then, each of those is split into their function part and their input part. This is the put into a character vector of functions named for their input.
parse("G30(H).G3(M).G0(L).Replicate(1)")
> H M L
"G30" "G3" "G0"
Or
parse("G35(L).G31(P).G02(K).Replicate(1)")
> L P K
"G35" "G31" "G02"
Upvotes: 1
Reputation: 11514
If the order is always the same, an alternative might be to split the strings. For instance:
string <- "G30(H).G3(M).G0(L).Replicate(1)"
tmp <- str_split(string, "\\.")[[1]]
lapply(tmp[1:3], function(x) str_split(x, "\\(")[[1]][1])
[[1]]
[1] "G30"
[[2]]
[1] "G3"
[[3]]
[1] "G0"
Upvotes: 1