Noozer Yame
Noozer Yame

Reputation: 1

R: parse nested parentheses with specific text

I'm relatively new to R programming, but have a specific problem concerning the extraction of text from a syntactically parsed historical language corpus. The problem should be easy to solve, but I just can't get my head around it. My question is basically a more specific variation of this one: R: parse nested parentheses

I would like to parse nested parentheses in R. Here is an example of some data:

(sometext(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))sometext)

From this string I would like to extract all (potentially nested) substrings that begin with "NP", so the result should be

(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))

(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))

(NPR_REYNARD)

(NP-PRN(D_THE)(N_FOXE))

Any help would be much appreciated!

Upvotes: 0

Views: 167

Answers (1)

MrFlick
MrFlick

Reputation: 206197

This probably isn't the most efficient, but here's a function which can extract the "tokens" or strings between matched parentheis.

find_tokens <- function(s) {
  stopifnot(length(s)==1)
  mm <- gregexpr("[)()]", s)
  stack <- numeric()
  starts <- numeric()
  stops <- numeric()
  Map(function(i, v) {
    if(v=="(") {
      stack <<- c(stack, i)
    } else if (v==")") {
      starts <<- c(starts, tail(stack, 1))
      stops <<- c(stops, i)
      stack <<- stack[-length(stack)]
    }
  }, mm[[1]], regmatches(s, mm)[[1]])
  rev(substring(s, starts, stops))
}

This will extract everything. If you want to keep just the values that start with "(NP" you can just grep this list

grep("^\\(NP", find_tokens(s), value=TRUE)
# [1] "(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))"
# [2] "(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))"                                 
# [3] "(NP-PRN(D_THE)(N_FOXE))"                                                  
# [4] "(NPR_REYNARD)"  

Here's another possible implementation of find_tokens that might be more efficient that will better support multiple strings as a list.

find_tokens <- function(s) {
  mm <- gregexpr("[)()]", s)
  vv <- regmatches(s, mm)
  extr <- function(x, mm, vv) {
    open_i <- 0
    shut_i <- 0
    open <- numeric(length(vv)/2)
    shut <- numeric(length(vv)/2)
    close <- numeric(length(vv)/2)
    for(i in seq_along(mm)) {
      if (vv[i]=="(") {
        open_i <- open_i + 1
        shut_i <- shut_i + 1
        open[open_i] <- mm[i]
        shut[shut_i] <- open_i
      } else if (vv[i]==")") {
        close[shut[shut_i]] <- mm[i]
        shut_i <- shut_i - 1
      }
    }
    substring(x, open, close)
  }
  unname(Map(extr, s, mm, vv))
}

and then you would use

lapply(find_tokens(s), function(x) grep("^\\(NP", x, value=TRUE))

Upvotes: 1

Related Questions