How to tokenise on hyphens using unnest_tokens in R

Question

I'm trying to tokenise a dataframe containing strings. Some contain hyphens, and I'd like to tokenise on hyphens using unnest_tokens()

I've tried upgrading tidytext from 0.1.9 to 0.2.0 I've tried a number of variations on regex to capture the hyphen from:



df <- data.frame(words = c("Solutions for the public sector | IT for business", "Transform the IT experience - IT Transformation - ITSM")

df %>% 
unnest_tokens(query, words, 
                token = "regex",
                pattern = "(?:\||\:|[-]|,)")

I expect to see:

query
solutions for the public sector
it for business
transform the it experience
it transformation
itsm

instead, I get the tokenised no hyphen lines:

query
solutions for the public sector
it for business

How to tokenise on hyphens using unnest_tokens in R

Answers (1)

Related Questions