Reputation: 534
I am currently using the unnest_tokens()
function from the tidytext
package. It works exactly as I need it to, however, it removes ampersands (&) from the text. I would like it to not do that, but leave everything else unchanged.
For example:
library(tidyverse)
library(tidytext)
d <- tibble(txt = "Let's go to the Q&A about B&B, it's great!")
d %>% unnest_tokens(word, txt, token="words")
currently returns
# A tibble: 11 x 1
word
<chr>
1 let's
2 go
3 to
4 the
5 q
6 a
7 about
8 b
9 b
10 it's
11 great
but I'd like it to return
# A tibble: 9 x 1
word
<chr>
1 let's
2 go
3 to
4 the
5 q&a
6 about
7 b&b
8 it's
9 great
Is there a way to send an option to unnest_tokens()
to do this, or send in the regex that it currently uses and manually adjust it to not include the ampersand?
Upvotes: 2
Views: 138
Reputation: 887501
We can use the token
as regex
library(tidytext)
library(dplyr)
d %>%
unnest_tokens(word, txt, token="regex", pattern = "[\\s!,.]")
# A tibble: 9 x 1
# word
# <chr>
#1 let's
#2 go
#3 to
#4 the
#5 q&a
#6 about
#7 b&b
#8 it's
#9 great
Upvotes: 2