Regex exercise in R

Question

I want to parse a string using R, and I'd like to get out a list of objects. Brackets, spaces and commas in the string dictate the structure of the final list:

each pair of brackets is separated by a space and the words in each pair of brackets has to form a new object of the list;
words in brackets are separated by comma and should form different elements in each listed object;
the mentioned structure can also be found nested within a pair of brackets.

Here is an example of the string:

x <- "(K01596,K01610) (K01689) (K01834,K15633,K15634,K15635) (K00927) (K00134,K00150) (K01803) ((K01623,K01624,K11645) (K03841,K02446,K11532,K01086,K04041),K01622)"

The desired output should like this:

list(c("K01596","K01610"), "K01689", c("K01834","K15633","K15634","K15635"), "K00927", c("K00134","K00150"), "K01803", list(list(c("K01623","K01624","K11645"), c("K03841","K02446","K11532","K01086","K04041")), "K01622"))

I manage to solve how to do the parsing for case 1)

match <- gregexpr("$(?>[^()]|(?R))*$", x, perl = T)
x2 <- as.list(substring(x, match[[1]], match[[1]] + attr(match[[1]], "match.length") - 1))

and case 2) is also easy, I can just remove the brackets with gsub and split the words using strsplit. The problem is how to parse case 3), when I have a nested level like:

((K01623,K01624,K11645) (K03841,K02446,K11532,K01086,K04041),K01622)

and I have to get out a listed object that is a list itself:

list(list(c("K01623","K01624","K11645"), c("K03841","K02446","K11532","K01086","K04041")), "K01622")

Eric Watt · Accepted Answer

You can convert to JSON, and then use jsonlite to convert to a list. Once you have this, you can simplify, collapse, or reorganize your list however you like.

library(jsonlite)
library(stringr)

add_paren <- function(x){
  x <- str_sub(x, end = -2) #remove comma
  paste0("(", x, "), ") #add enclosing paren and return comma
} 
x <- str_replace_all(x, "$\(.*$\,", add_paren)

x <- gsub("$", "\[", x)
x <- gsub("$", "\]", x)
x <- gsub("\] $$", "$$, \[", x)

add_quote <- function(x) paste0('"', x, '"')

x <- str_replace_all(x, "K[0-9]*", add_quote)
x <- paste0("[", x, "]")

x2 <- fromJSON(x)

Resulting in:

dput(x2)

list(c("K01596", "K01610"), "K01689", c("K01834", "K15633", "K15634", 
"K15635"), "K00927", c("K00134", "K00150"), "K01803", list(list(
    c("K01623", "K01624", "K11645"), c("K03841", "K02446", "K11532", 
    "K01086", "K04041")), "K01622"))

str(x2)

List of 7
 $ : chr [1:2] "K01596" "K01610"
 $ : chr "K01689"
 $ : chr [1:4] "K01834" "K15633" "K15634" "K15635"
 $ : chr "K00927"
 $ : chr [1:2] "K00134" "K00150"
 $ : chr "K01803"
 $ :List of 2
  ..$ :List of 2
  .. ..$ : chr [1:3] "K01623" "K01624" "K11645"
  .. ..$ : chr [1:5] "K03841" "K02446" "K11532" "K01086" ...
  ..$ : chr "K01622"

Regex exercise in R

Answers (2)

Related Questions