balin
balin

Reputation: 1686

R: Regex madness (stringi)

I have a vector of strings that look like this:

G30(H).G3(M).G0(L).Replicate(1)

Iterating over c("H", "M", "L"), I would like to extract G30 (for "H"), G3 (for "M") and G0 (for "L").

My various attempts have me confused - the regex101.com debugger, e.g. indicates that (\w*)\(M\) works just fine, but transferring that to R fails ...

Upvotes: 0

Views: 203

Answers (5)

Nathan Werth
Nathan Werth

Reputation: 5273

Using the stringi package and the outer() function:

library(stringi)

strings <- c(
  "G30(H).G3(M).G0(L).Replicate(1)",
  "G5(M).G11(L).G6(H).Replicate(9)",
  "G10(M).G6(H).G8(M).Replicate(200)"  # No "L", repeated "M"
)
targets  <- c("H", "M", "L")
patterns <- paste0("\\w+(?=\\(", targets, "\\))")
matches  <- outer(strings, patterns, FUN = stri_extract_first_regex)
colnames(matches) <- targets
matches
#      H     M    L    
# [1,] "G30" "G3" "G0" 
# [2,] "G6"  "G5" "G11"
# [3,] "G6"  "G10" NA

This ignores any instances of a target letter past the first, gives you an NA when the target's not found, and returns everything in a simple matrix. The regular expressions stored in patterns match substrings like XX(Y), where Y is the target letter and XX is any number of word characters.

Upvotes: 2

drmariod
drmariod

Reputation: 11762

I am pretty sure there are better solutions, but this works...

jnk <- 'G30(H).G3(M).G0(L).Replicate(1)'
patter <- '([^\\(]+)\\(H\\)\\.([^\\(]+)\\(M\\)\\.([^\\(]+)\\(L\\)\\.Replicate\\(\\d+\\)'
H <- sub(patter, '\\1', jnk)
M <- sub(patter, '\\2', jnk)
L <- sub(patter, '\\3', jnk)

EDIT:

Actually, I found once a very nice function parse.one which makes it possible to search more in a python like regular expression way...

Have a look at this:

parse.one <- function(res, result) {
  m <- do.call(rbind, lapply(seq_along(res), function(i) {
    if(result[i] == -1) return("")
    st <- attr(result, "capture.start")[i, ]
    substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
  }))
  colnames(m) <- attr(result, "capture.names")
  m
}
jnk <- 'G30(H).G3(M).G0(L).Replicate(1)'
pattern <- '(?<H>[^\\(]+)\\(H\\)\\.(?<M>[^\\(]+)\\(M\\)\\.(?<L>[^\\(]+)\\(L\\)\\.Replicate\\(\\d+\\)'
parse.one(jnk, regexpr(pattern, jnk, perl=TRUE))

Result looks like this:

> parse.one(jnk, regexpr(pattern, jnk, perl=TRUE))
     H     M    L   
[1,] "G30" "G3" "G0"

Upvotes: 1

Damiano Fantini
Damiano Fantini

Reputation: 1975

If codes (e.g., 'G30') preceding the tags(e.g., '(H).') or the order of the tags in the string are allowed to change (different letters or length), you may want to try a more flexible solution based on regexpr().

aa <-paste("G30(H).G3(M).G0(L).Replicate(",1:10,")", sep="")
my.tags <- c("H","M", "L")

extr.data <- lapply(my.tags, (function(tag){
  pat <-  paste("\\(", tag, "\\)\\.", sep="")
  pos <- regexpr(paste("(^|\\.)([[:alnum:]])*", pat ,sep=""), aa)
  out <- substr(aa, pos, (pos+attributes(pos)$match.length - 4 - length(tag)))  
  gsub("(^\\.)", "", out) 
}))
names(extr.data) <- my.tags
extr.data

Upvotes: 1

Mark
Mark

Reputation: 4537

I'm going to assume that the functions (G...) are variable and the inputs are variable. This does assume that your functions start with a G and your input is always a letter.

parse = function(arb){
  tmp = stringi::stri_extract_all_regex(arb,"G.*?\\([A-Z]\\)")[[1]]
  unlist(lapply(lapply(tmp,strsplit,"\\)|\\("),function(x){
    output = x[[1]][1]
    names(output) = x[[1]][2]
    return(output)
  }))
}

This first parses out all the G functions with their inputs. Then, each of those is split into their function part and their input part. This is the put into a character vector of functions named for their input.

parse("G30(H).G3(M).G0(L).Replicate(1)")
>     H     M     L 
  "G30"  "G3"  "G0"

Or

parse("G35(L).G31(P).G02(K).Replicate(1)")
>     L     P     K 
  "G35" "G31" "G02" 

Upvotes: 1

coffeinjunky
coffeinjunky

Reputation: 11514

If the order is always the same, an alternative might be to split the strings. For instance:

string <- "G30(H).G3(M).G0(L).Replicate(1)"
tmp <- str_split(string, "\\.")[[1]]
lapply(tmp[1:3], function(x) str_split(x, "\\(")[[1]][1])
[[1]]
[1] "G30"

[[2]]
[1] "G3"

[[3]]
[1] "G0"

Upvotes: 1

Related Questions