How to exclude a few patterns in R regex, some might be 2 or more char

Question

I'm trying to use the following regex to tokenize a string into tokens.

Basically, this string is R source code. Therefore I want to separate :punct: into individual tokens.

However, I want to keep ' and _ in any word since they belong to a single token.

My question is, how can I add more cases such as ==, <=, <-, &&. I tried ['_==<=<-&&], but I don't think this is the right way.

strsplit(str, "(\s+)|(?!['_])(?=[[:punct:]])", perl = TRUE)

Aur&#232;le · Accepted Answer

It's better to use the R parser itself than to do it yourself (which is a difficult task, since you'd have to basically re-implement it).

For instance:

x <- parse(text = "x <- c(1, 4)
 x ^ 3 -10 ; outer(1:7, 5:9)
 a <-3 ; a < -3")

str(lapply(as.list(x), as.list))

List of 5
 $ :List of 3
  ..$ : symbol <-
  ..$ : symbol x
  ..$ : language c(1, 4)
 $ :List of 3
  ..$ : symbol -
  ..$ : language x^3
  ..$ : num 10
 $ :List of 3
  ..$ : symbol outer
  ..$ : language 1:7
  ..$ : language 5:9
 $ :List of 3
  ..$ : symbol <-
  ..$ : symbol a
  ..$ : num 3
 $ :List of 3
  ..$ : symbol <
  ..$ : symbol a
  ..$ : language -3

Edit

(per OP's comment)

str <- "x <- c(1, 4)
 x ^ 3 -10 ; outer(1:7, 5:9)
 a <-3 ; a < -3"

Filter(function(x) x != "", getParseData(parse(text = str))$text)

#  [1] "x"     "<-"    "c"     "("     "1"     ","     "4"    
#  [8] ")"     "x"     "^"     "3"     "-"     "10"    ";"    
# [15] "outer" "("     "1"     ":"     "7"     ","     "5"    
# [22] ":"     "9"     ")"     "a"     "<-"    "3"     ";"    
# [29] "a"     "<"     "-"     "3"

How to exclude a few patterns in R regex, some might be 2 or more char

Answers (1)

Related Questions