Reputation: 110
I´m doing a project on Text Mining, and therefore I want to write a small function counting the number of distinct tokens within a text. The Tokenization is done by the function tidytext::unnest_token
which is basically a wrapper for the usage of tokenizers::tokenize_ngrams
with the pipe. My current approach is the following:
count_tokens <- function(data,output,token="words", ...){
data %>%
select(textcolumn) %>%
tidytext::unnest_tokens(tbl=output, input=textcolumn, ...) %>%
n_distinct()
}
This works fine (even with the ...
) as long as I use arguments from tidytext::unnest_token
, such as to_lower
or drop
.
count_tokens(data, word, to_lower = FALSE)
works fine!
Now, the documentation of tidytext::unnest_token
states that ...
can also be used as extra arguments passed on to tokenizers, such as strip_punct
for "words" and "tweets", n
and k
for "ngrams" and "skip_ngrams", (...). However, if I include the parameter n
from the ellipsis argument in my function, it crashes.
count_tokens(data, ngram, token = "ngrams", to_lower = FALSE, n = 10)
brings up the following error message:
Error in tf(col, lowercase = to_lower, ...) : unused argument (n = 10)
Can someone point me in the right direction or even tell me how I need to adapt my code?
Upvotes: 0
Views: 74
Reputation: 4184
First of all your example seems to not be a valid one. For me updated function works correctly.
library(dplyr)
library(tidytext)
library(janeaustenr)
count_tokens <- function(data,output,token="words", ...){
d %>%
select(txt) %>%
tidytext::unnest_tokens(output, input="txt", token = token, ...) %>%
n_distinct()
}
d <- tibble(txt = prideprejudice)
count_tokens(d, "word", to_lower = FALSE)
#> [1] 6915
count_tokens(d, "ngram", token = "ngrams", to_lower = FALSE, n = 8)
#> [1] 122189
count_tokens(d, "ngram", token = "ngrams", to_lower = FALSE, n = 5)
#> [1] 121599
count_tokens(d, "ngram", token = "ngrams", to_lower = FALSE, n = 3)
#> [1] 104664
Created on 2021-02-03 by the reprex package (v0.3.0)
Upvotes: 2