Serendipity
Serendipity

Reputation: 49

Quanteda and stringr in R: (Correct) regex cannot be parsed

I want to run a regex search using the quanteda and stringr libraries, but I continue to receive errors. My goal is to match the patterns (VP (V.. ...) using the regex \(VP\h+\(V\w*\h+\w*\). Here is a MWE:

library(quanteda)
library(dplyr)
library(stringr)

text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"


kwic_regex <- kwic(
  # define text
  text, 
  # define search pattern
  "\(VP\h+\(V\w*\h+\w*\)", 
  window = 20, 
  # define valuetype
  valuetype = "regex") %>%
  # make it a data frame
  as.data.frame()

And this is the error message:

Error: '\(' is an unrecognized escape in character string starting ""\("

I find it puzzling because the regex should be correct (cf. https://regex101.com/r/3hbZ0R/1). I've also tried escaping the escapes (e.g., \\() to no avail. I would really appreciate any ideas on how to improve my query.

Upvotes: 1

Views: 74

Answers (3)

Ken Benoit
Ken Benoit

Reputation: 14902

To get this to work, you have to understand how tokenisation works in quanteda and how pattern works with multi-token sequences.

First, tokenisation (by default) removes the whitespace that you are including in your regex pattern. But for your pattern, this is not the important part; rather, the sequence is the important part. Also, the current default tokeniser will split parentheses from the POS tags and text. So you want to control this by using a different tokeniser that splits on (and removes) whitespace. See ?tokens and ?pattern.

Second, to match sequences of tokens, you need to wrap your multi-token pattern in phrase(), which will split it on whitespace. See ?phrase.

So this will work (and very efficiently):

library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.

txt <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"

toks <- tokens(txt, what = "fasterword", remove_separators = TRUE)
print(toks, -1, -1)
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "(ROOT"        "(S"           "(NP"          "(PRP"         "It))"        
#>  [6] "(VP"          "(VBZ"         "is)"          "(RB"          "not)"        
#> [11] "(VP"          "(VBN"         "transmitted)" "(PP"          "(IN"         
#> [16] "from)"        "(:"           ":)"           "(S"           "(VP"         
#> [21] "(VBG"         "giving)"      "(NP"          "(NP"          "(NP"         
#> [26] "(NP"          "(NML"         "(NN"          "blood)"

kwic(toks, phrase("\\(VP \\(V \\)"), window = 3, valuetype = "regex")
#> Keyword-in-context with 3 matches.                                                                     
#>    [text1, 6:8] (NP (PRP It)) |     (VP (VBZ is)      | (RB not) (VP 
#>  [text1, 11:13]  is) (RB not) | (VP (VBN transmitted) | (PP (IN from)
#>  [text1, 20:22]       (::) (S |   (VP (VBG giving)    | (NP (NP (NP

Created on 2023-07-03 with reprex v2.0.2

Note how you do need to double-escape the reserved characters in the regular expression pattern.

Created on 2023-07-03 with reprex v2.0.2

Upvotes: 1

Serendipity
Serendipity

Reputation: 49

I've identified the problem: Apparently, the kwic() function no longer supports spaces (cf. kwic in quanteda (R) does not identify more than one word in regex pattern). I've also used the token() function before running the search and wrapped the expression in phrase().

Here is the corrected code:

library(quanteda)
library(dplyr)
library(stringr)
library(tidyverse)

rm(list=ls(all=T))

text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"

text2 <- tokens(text)


kwic_regex <- kwic(
  text2, 
  phrase("\\( VP \\V\\w* \\w* \\w* \\)"), 
  window = 10, 
  separator = " ",
  case_insensitive = F,
  valuetype = "regex") %>%
  as.data.frame(); kwic_regex

Output:

  docname from to                        pre                  keyword
1   text1   12 17 ROOT ( S ( NP ( PRP It ) )          ( VP ( VBZ is )
2   text1   22 27 ( VP ( VBZ is ) ( RB not ) ( VP ( VBN transmitted )
3   text1   40 45    ( IN from ) ( : : ) ( S      ( VP ( VBG giving )
                                 post                      pattern
1 ( RB not ) ( VP ( VBN transmitted ) \\( VP \\V\\w* \\w* \\w* \\)
2            ( PP ( IN from ) ( : : ) \\( VP \\V\\w* \\w* \\w* \\)
3           ( NP ( NP ( NP ( NP ( NML \\( VP \\V\\w* \\w* \\w* \\)

Upvotes: 0

sln
sln

Reputation: 2759

R strings seems to parse a double quoted string by checking first if they find
an escaped allowed character that they can substitute the resultant control code.
Since escaped escape is universally recognized to resolve to a single quote literal,
all escapes will resolve after parsing to the raw regex string passed to the function.

So your double quoted string should be "\\(VP\\h+\\(V\\w*\\h+\\w*\\)" which gets parsed to \(VP\h+\(V\w*\h+\w*\) which is handed to the stringr function.

 library(stringr)
 str_match_all(
 "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)",
 "\\(VP\\h+\\(V\\w*\\h+\\w*\\)" )

https://www.mycompiler.io/view/BjjkPXQUNpT

Output

 [[1]]
      [,1]                   
 [1,] "(VP (VBZ is)"         
 [2,] "(VP (VBN transmitted)"
 [3,] "(VP (VBG giving)"     

Each language enforces different parsing rules.
Some will throw an error if an unknown escape sequence is encounterred
like \( others will simply strip the escape to this ( and not tell you about it.

Upvotes: 1

Related Questions