Reputation: 49
I want to run a regex search using the quanteda
and stringr
libraries, but I continue to receive errors. My goal is to match the patterns (VP (V.. ...)
using the regex \(VP\h+\(V\w*\h+\w*\)
. Here is a MWE:
library(quanteda)
library(dplyr)
library(stringr)
text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
kwic_regex <- kwic(
# define text
text,
# define search pattern
"\(VP\h+\(V\w*\h+\w*\)",
window = 20,
# define valuetype
valuetype = "regex") %>%
# make it a data frame
as.data.frame()
And this is the error message:
Error: '\(' is an unrecognized escape in character string starting ""\("
I find it puzzling because the regex should be correct (cf. https://regex101.com/r/3hbZ0R/1). I've also tried escaping the escapes (e.g., \\(
) to no avail. I would really appreciate any ideas on how to improve my query.
Upvotes: 1
Views: 74
Reputation: 14902
To get this to work, you have to understand how tokenisation works in quanteda and how pattern
works with multi-token sequences.
First, tokenisation (by default) removes the whitespace that you are including in your regex pattern. But for your pattern, this is not the important part; rather, the sequence is the important part. Also, the current default tokeniser will split parentheses from the POS tags and text. So you want to control this by using a different tokeniser that splits on (and removes) whitespace. See ?tokens
and ?pattern
.
Second, to match sequences of tokens, you need to wrap your multi-token pattern in phrase()
, which will split it on whitespace. See ?phrase
.
So this will work (and very efficiently):
library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
toks <- tokens(txt, what = "fasterword", remove_separators = TRUE)
print(toks, -1, -1)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "(ROOT" "(S" "(NP" "(PRP" "It))"
#> [6] "(VP" "(VBZ" "is)" "(RB" "not)"
#> [11] "(VP" "(VBN" "transmitted)" "(PP" "(IN"
#> [16] "from)" "(:" ":)" "(S" "(VP"
#> [21] "(VBG" "giving)" "(NP" "(NP" "(NP"
#> [26] "(NP" "(NML" "(NN" "blood)"
kwic(toks, phrase("\\(VP \\(V \\)"), window = 3, valuetype = "regex")
#> Keyword-in-context with 3 matches.
#> [text1, 6:8] (NP (PRP It)) | (VP (VBZ is) | (RB not) (VP
#> [text1, 11:13] is) (RB not) | (VP (VBN transmitted) | (PP (IN from)
#> [text1, 20:22] (::) (S | (VP (VBG giving) | (NP (NP (NP
Created on 2023-07-03 with reprex v2.0.2
Note how you do need to double-escape the reserved characters in the regular expression pattern.
Created on 2023-07-03 with reprex v2.0.2
Upvotes: 1
Reputation: 49
I've identified the problem: Apparently, the kwic()
function no longer supports spaces (cf. kwic in quanteda (R) does not identify more than one word in regex pattern). I've also used the token()
function before running the search and wrapped the expression in phrase()
.
Here is the corrected code:
library(quanteda)
library(dplyr)
library(stringr)
library(tidyverse)
rm(list=ls(all=T))
text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
text2 <- tokens(text)
kwic_regex <- kwic(
text2,
phrase("\\( VP \\V\\w* \\w* \\w* \\)"),
window = 10,
separator = " ",
case_insensitive = F,
valuetype = "regex") %>%
as.data.frame(); kwic_regex
Output:
docname from to pre keyword
1 text1 12 17 ROOT ( S ( NP ( PRP It ) ) ( VP ( VBZ is )
2 text1 22 27 ( VP ( VBZ is ) ( RB not ) ( VP ( VBN transmitted )
3 text1 40 45 ( IN from ) ( : : ) ( S ( VP ( VBG giving )
post pattern
1 ( RB not ) ( VP ( VBN transmitted ) \\( VP \\V\\w* \\w* \\w* \\)
2 ( PP ( IN from ) ( : : ) \\( VP \\V\\w* \\w* \\w* \\)
3 ( NP ( NP ( NP ( NP ( NML \\( VP \\V\\w* \\w* \\w* \\)
Upvotes: 0
Reputation: 2759
R strings seems to parse a double quoted string by checking first if they find
an escaped allowed character that they can substitute the resultant control code.
Since escaped escape is universally recognized to resolve to a single quote literal,
all escapes will resolve after parsing to the raw
regex string passed to the function.
So your double quoted string should be "\\(VP\\h+\\(V\\w*\\h+\\w*\\)"
which gets parsed to \(VP\h+\(V\w*\h+\w*\)
which is handed to the stringr function.
library(stringr)
str_match_all(
"(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)",
"\\(VP\\h+\\(V\\w*\\h+\\w*\\)" )
https://www.mycompiler.io/view/BjjkPXQUNpT
Output
[[1]]
[,1]
[1,] "(VP (VBZ is)"
[2,] "(VP (VBN transmitted)"
[3,] "(VP (VBG giving)"
Each language enforces different parsing rules.
Some will throw an error if an unknown escape sequence is encounterred
like \(
others will simply strip the escape to this (
and not tell you about it.
Upvotes: 1