Luca Zoccarato
Luca Zoccarato

Reputation: 43

Regex exercise in R

I want to parse a string using R, and I'd like to get out a list of objects. Brackets, spaces and commas in the string dictate the structure of the final list:

  1. each pair of brackets is separated by a space and the words in each pair of brackets has to form a new object of the list;

  2. words in brackets are separated by comma and should form different elements in each listed object;

  3. the mentioned structure can also be found nested within a pair of brackets.

Here is an example of the string:

x <- "(K01596,K01610) (K01689) (K01834,K15633,K15634,K15635) (K00927) (K00134,K00150) (K01803) ((K01623,K01624,K11645) (K03841,K02446,K11532,K01086,K04041),K01622)"

The desired output should like this:

list(c("K01596","K01610"), "K01689", c("K01834","K15633","K15634","K15635"), "K00927", c("K00134","K00150"), "K01803", list(list(c("K01623","K01624","K11645"), c("K03841","K02446","K11532","K01086","K04041")), "K01622"))

I manage to solve how to do the parsing for case 1)

match <- gregexpr("\\((?>[^()]|(?R))*\\)", x, perl = T)
x2 <- as.list(substring(x, match[[1]], match[[1]] + attr(match[[1]], "match.length") - 1))

and case 2) is also easy, I can just remove the brackets with gsub and split the words using strsplit. The problem is how to parse case 3), when I have a nested level like:

((K01623,K01624,K11645) (K03841,K02446,K11532,K01086,K04041),K01622)

and I have to get out a listed object that is a list itself:

list(list(c("K01623","K01624","K11645"), c("K03841","K02446","K11532","K01086","K04041")), "K01622")

Upvotes: 2

Views: 191

Answers (2)

Eric Watt
Eric Watt

Reputation: 3240

You can convert to JSON, and then use jsonlite to convert to a list. Once you have this, you can simplify, collapse, or reorganize your list however you like.

library(jsonlite)
library(stringr)

add_paren <- function(x){
  x <- str_sub(x, end = -2) #remove comma
  paste0("(", x, "), ") #add enclosing paren and return comma
} 
x <- str_replace_all(x, "\\(\\(.*\\)\\,", add_paren)

x <- gsub("\\(", "\\[", x)
x <- gsub("\\)", "\\]", x)
x <- gsub("\\] \\[", "\\], \\[", x)

add_quote <- function(x) paste0('"', x, '"')

x <- str_replace_all(x, "K[0-9]*", add_quote)
x <- paste0("[", x, "]")

x2 <- fromJSON(x)

Resulting in:

dput(x2)

list(c("K01596", "K01610"), "K01689", c("K01834", "K15633", "K15634", 
"K15635"), "K00927", c("K00134", "K00150"), "K01803", list(list(
    c("K01623", "K01624", "K11645"), c("K03841", "K02446", "K11532", 
    "K01086", "K04041")), "K01622"))

str(x2)

List of 7
 $ : chr [1:2] "K01596" "K01610"
 $ : chr "K01689"
 $ : chr [1:4] "K01834" "K15633" "K15634" "K15635"
 $ : chr "K00927"
 $ : chr [1:2] "K00134" "K00150"
 $ : chr "K01803"
 $ :List of 2
  ..$ :List of 2
  .. ..$ : chr [1:3] "K01623" "K01624" "K11645"
  .. ..$ : chr [1:5] "K03841" "K02446" "K11532" "K01086" ...
  ..$ : chr "K01622"

Upvotes: 1

AEF
AEF

Reputation: 5670

I suggest you apply the regex you already found for case 1) recursively to the input. That is, call your recursive function for each match found.

If no match is found you are in case 2) and can just use strsplit on the input. I have put together an example function below:

constructList <- function(x) {

  matches <- gregexpr("\\((?>[^()]|(?R))*\\)", x, perl = T)

  if (matches[[1]][1] == -1) {
    return(strsplit(x, ",")[[1]])
  }

  lapply(
    lapply(1:length(matches[[1]]), function(i)
                                        substr(x,
                                               matches[[1]][i] + 1,
                                               matches[[1]][i] + attr(matches[[1]], "match.length")[i] - 2)),
    constructList)

}

Output seems OK:

constructList(x)
[[1]]
[1] "K01596" "K01610"

[[2]]
[1] "K01689"

[[3]]
[1] "K01834" "K15633" "K15634" "K15635"

[[4]]
[1] "K00927"

[[5]]
[1] "K00134" "K00150"

[[6]]
[1] "K01803"

[[7]]
[[7]][[1]]
[1] "K01623" "K01624" "K11645"

[[7]][[2]]
[1] "K03841" "K02446" "K11532" "K01086" "K04041"

Upvotes: 0

Related Questions