Pawliczek
Pawliczek

Reputation: 63

Tokenizing word using tidytext - preserving punctuation

I've been trying to preserve punctation like "-" "(" "/" "'" when tokenizing word.

data = tibble(title = "Computer-aided detection (1 / 2)")
data %>% unnest_tokens(input = title, 
                    output = słowo, 
                    token = "ngrams", 
                    n = 2)

I want output to be like this:

computer-aided
aided detection
detection (1
(1 / 2)

Any suggestions?

Upvotes: 1

Views: 235

Answers (1)

phiver
phiver

Reputation: 23608

If you want to preserve these values "(" "/" ")" the output would be "(1 /" and "/ 2)" not "(1 / 2)". This last one would be a 3gram. Also if you want to keep the hyphen (-) line 2 would not exist as it would not split on this value.

tidytext uses the tokenizer package to unnest the data. the ngram tokenizer can not handle these exemptions.

Here is an example using quanteda with the option fasterword that gets most of your needs.

library(quanteda)
tokens(data$title, what =  "fasterword", remove_punct = FALSE) %>% 
  tokens_ngrams(n = 2, concatenator = " ")

Tokens consisting of 1 document.
text1 :
[1] "Computer-aided detection" "detection (1"             "(1 /"                     "/ 2)"  

You could experiment with different values of n like n = 2:3 to see where that gets you and filter out what you don't need.

Upvotes: 1

Related Questions