Reputation: 107
I'm trying to tokenise a dataframe containing strings. Some contain hyphens, and I'd like to tokenise on hyphens using unnest_tokens()
I've tried upgrading tidytext from 0.1.9 to 0.2.0 I've tried a number of variations on regex to capture the hyphen from:
df <- data.frame(words = c("Solutions for the public sector | IT for business", "Transform the IT experience - IT Transformation - ITSM")
df %>%
unnest_tokens(query, words,
token = "regex",
pattern = "(?:\\||\\:|[-]|,)")
I expect to see:
query
solutions for the public sector
it for business
transform the it experience
it transformation
itsm
instead, I get the tokenised no hyphen lines:
query
solutions for the public sector
it for business
Upvotes: 1
Views: 440
Reputation: 627082
You may use
library(stringr)
df %>%
unnest_tokens(query, words, token = stringr::str_split, pattern = "[-:,|]")
This command will use stringr::str_split
to split against the [-:,|]
pattern: -
, :
, ,
or |
chars. Note they do not need to be escaped inside a character class/bracket expression. The hyphen does not need to be escaped when it is the first or last char, and the others are just not special in a character class.
Upvotes: 1