Reputation: 43
I want to parse a string using R, and I'd like to get out a list of objects. Brackets, spaces and commas in the string dictate the structure of the final list:
each pair of brackets is separated by a space and the words in each pair of brackets has to form a new object of the list;
words in brackets are separated by comma and should form different elements in each listed object;
the mentioned structure can also be found nested within a pair of brackets.
Here is an example of the string:
x <- "(K01596,K01610) (K01689) (K01834,K15633,K15634,K15635) (K00927) (K00134,K00150) (K01803) ((K01623,K01624,K11645) (K03841,K02446,K11532,K01086,K04041),K01622)"
The desired output should like this:
list(c("K01596","K01610"), "K01689", c("K01834","K15633","K15634","K15635"), "K00927", c("K00134","K00150"), "K01803", list(list(c("K01623","K01624","K11645"), c("K03841","K02446","K11532","K01086","K04041")), "K01622"))
I manage to solve how to do the parsing for case 1)
match <- gregexpr("\\((?>[^()]|(?R))*\\)", x, perl = T)
x2 <- as.list(substring(x, match[[1]], match[[1]] + attr(match[[1]], "match.length") - 1))
and case 2) is also easy, I can just remove the brackets with gsub and split the words using strsplit. The problem is how to parse case 3), when I have a nested level like:
((K01623,K01624,K11645) (K03841,K02446,K11532,K01086,K04041),K01622)
and I have to get out a listed object that is a list itself:
list(list(c("K01623","K01624","K11645"), c("K03841","K02446","K11532","K01086","K04041")), "K01622")
Upvotes: 2
Views: 191
Reputation: 3240
You can convert to JSON, and then use jsonlite to convert to a list. Once you have this, you can simplify, collapse, or reorganize your list however you like.
library(jsonlite)
library(stringr)
add_paren <- function(x){
x <- str_sub(x, end = -2) #remove comma
paste0("(", x, "), ") #add enclosing paren and return comma
}
x <- str_replace_all(x, "\\(\\(.*\\)\\,", add_paren)
x <- gsub("\\(", "\\[", x)
x <- gsub("\\)", "\\]", x)
x <- gsub("\\] \\[", "\\], \\[", x)
add_quote <- function(x) paste0('"', x, '"')
x <- str_replace_all(x, "K[0-9]*", add_quote)
x <- paste0("[", x, "]")
x2 <- fromJSON(x)
Resulting in:
dput(x2)
list(c("K01596", "K01610"), "K01689", c("K01834", "K15633", "K15634",
"K15635"), "K00927", c("K00134", "K00150"), "K01803", list(list(
c("K01623", "K01624", "K11645"), c("K03841", "K02446", "K11532",
"K01086", "K04041")), "K01622"))
str(x2)
List of 7
$ : chr [1:2] "K01596" "K01610"
$ : chr "K01689"
$ : chr [1:4] "K01834" "K15633" "K15634" "K15635"
$ : chr "K00927"
$ : chr [1:2] "K00134" "K00150"
$ : chr "K01803"
$ :List of 2
..$ :List of 2
.. ..$ : chr [1:3] "K01623" "K01624" "K11645"
.. ..$ : chr [1:5] "K03841" "K02446" "K11532" "K01086" ...
..$ : chr "K01622"
Upvotes: 1
Reputation: 5670
I suggest you apply the regex you already found for case 1) recursively to the input. That is, call your recursive function for each match found.
If no match is found you are in case 2) and can just use strsplit on the input. I have put together an example function below:
constructList <- function(x) {
matches <- gregexpr("\\((?>[^()]|(?R))*\\)", x, perl = T)
if (matches[[1]][1] == -1) {
return(strsplit(x, ",")[[1]])
}
lapply(
lapply(1:length(matches[[1]]), function(i)
substr(x,
matches[[1]][i] + 1,
matches[[1]][i] + attr(matches[[1]], "match.length")[i] - 2)),
constructList)
}
Output seems OK:
constructList(x)
[[1]]
[1] "K01596" "K01610"
[[2]]
[1] "K01689"
[[3]]
[1] "K01834" "K15633" "K15634" "K15635"
[[4]]
[1] "K00927"
[[5]]
[1] "K00134" "K00150"
[[6]]
[1] "K01803"
[[7]]
[[7]][[1]]
[1] "K01623" "K01624" "K11645"
[[7]][[2]]
[1] "K03841" "K02446" "K11532" "K01086" "K04041"
Upvotes: 0