Reputation: 1
I'm relatively new to R programming, but have a specific problem concerning the extraction of text from a syntactically parsed historical language corpus. The problem should be easy to solve, but I just can't get my head around it. My question is basically a more specific variation of this one: R: parse nested parentheses
I would like to parse nested parentheses in R. Here is an example of some data:
(sometext(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))sometext)
From this string I would like to extract all (potentially nested) substrings that begin with "NP", so the result should be
(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))
(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))
(NPR_REYNARD)
(NP-PRN(D_THE)(N_FOXE))
Any help would be much appreciated!
Upvotes: 0
Views: 167
Reputation: 206197
This probably isn't the most efficient, but here's a function which can extract the "tokens" or strings between matched parentheis.
find_tokens <- function(s) {
stopifnot(length(s)==1)
mm <- gregexpr("[)()]", s)
stack <- numeric()
starts <- numeric()
stops <- numeric()
Map(function(i, v) {
if(v=="(") {
stack <<- c(stack, i)
} else if (v==")") {
starts <<- c(starts, tail(stack, 1))
stops <<- c(stops, i)
stack <<- stack[-length(stack)]
}
}, mm[[1]], regmatches(s, mm)[[1]])
rev(substring(s, starts, stops))
}
This will extract everything. If you want to keep just the values that start with "(NP" you can just grep this list
grep("^\\(NP", find_tokens(s), value=TRUE)
# [1] "(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))"
# [2] "(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))"
# [3] "(NP-PRN(D_THE)(N_FOXE))"
# [4] "(NPR_REYNARD)"
Here's another possible implementation of find_tokens
that might be more efficient that will better support multiple strings as a list.
find_tokens <- function(s) {
mm <- gregexpr("[)()]", s)
vv <- regmatches(s, mm)
extr <- function(x, mm, vv) {
open_i <- 0
shut_i <- 0
open <- numeric(length(vv)/2)
shut <- numeric(length(vv)/2)
close <- numeric(length(vv)/2)
for(i in seq_along(mm)) {
if (vv[i]=="(") {
open_i <- open_i + 1
shut_i <- shut_i + 1
open[open_i] <- mm[i]
shut[shut_i] <- open_i
} else if (vv[i]==")") {
close[shut[shut_i]] <- mm[i]
shut_i <- shut_i - 1
}
}
substring(x, open, close)
}
unname(Map(extr, s, mm, vv))
}
and then you would use
lapply(find_tokens(s), function(x) grep("^\\(NP", x, value=TRUE))
Upvotes: 1