Reputation:
I'm trying to use the following regex to tokenize a string into tokens.
Basically, this string is R source code. Therefore I want to separate :punct:
into individual tokens.
However, I want to keep '
and _
in any word since they belong to a single token.
My question is, how can I add more cases such as ==
, <=
, <-
, &&
. I tried ['_==<=<-&&]
, but I don't think this is the right way.
strsplit(str, "(\\s+)|(?!['_])(?=[[:punct:]])", perl = TRUE)
Upvotes: 0
Views: 52
Reputation: 12819
It's better to use the R parser itself than to do it yourself (which is a difficult task, since you'd have to basically re-implement it).
For instance:
x <- parse(text = "x <- c(1, 4)\n x ^ 3 -10 ; outer(1:7, 5:9)\n a <-3 ; a < -3")
str(lapply(as.list(x), as.list))
List of 5
$ :List of 3
..$ : symbol <-
..$ : symbol x
..$ : language c(1, 4)
$ :List of 3
..$ : symbol -
..$ : language x^3
..$ : num 10
$ :List of 3
..$ : symbol outer
..$ : language 1:7
..$ : language 5:9
$ :List of 3
..$ : symbol <-
..$ : symbol a
..$ : num 3
$ :List of 3
..$ : symbol <
..$ : symbol a
..$ : language -3
Edit
(per OP's comment)
str <- "x <- c(1, 4)\n x ^ 3 -10 ; outer(1:7, 5:9)\n a <-3 ; a < -3"
Filter(function(x) x != "", getParseData(parse(text = str))$text)
# [1] "x" "<-" "c" "(" "1" "," "4"
# [8] ")" "x" "^" "3" "-" "10" ";"
# [15] "outer" "(" "1" ":" "7" "," "5"
# [22] ":" "9" ")" "a" "<-" "3" ";"
# [29] "a" "<" "-" "3"
Upvotes: 2