whack_overflow
whack_overflow

Reputation: 110

Include extra (ellipsis/dot dot dot) arguments within custom function

I´m doing a project on Text Mining, and therefore I want to write a small function counting the number of distinct tokens within a text. The Tokenization is done by the function tidytext::unnest_token which is basically a wrapper for the usage of tokenizers::tokenize_ngrams with the pipe. My current approach is the following:

count_tokens <- function(data,output,token="words", ...){
  data %>% 
    select(textcolumn) %>% 
    tidytext::unnest_tokens(tbl=output, input=textcolumn, ...) %>% 
    n_distinct()
}

This works fine (even with the ...) as long as I use arguments from tidytext::unnest_token, such as to_lower or drop. count_tokens(data, word, to_lower = FALSE) works fine!

Now, the documentation of tidytext::unnest_token states that ... can also be used as extra arguments passed on to tokenizers, such as strip_punct for "words" and "tweets", n and k for "ngrams" and "skip_ngrams", (...). However, if I include the parameter n from the ellipsis argument in my function, it crashes.

count_tokens(data, ngram, token = "ngrams", to_lower = FALSE, n = 10) brings up the following error message:

Error in tf(col, lowercase = to_lower, ...) : unused argument (n = 10)

Can someone point me in the right direction or even tell me how I need to adapt my code?

Upvotes: 0

Views: 74

Answers (1)

polkas
polkas

Reputation: 4184

First of all your example seems to not be a valid one. For me updated function works correctly.

library(dplyr)
library(tidytext)
library(janeaustenr)

count_tokens <- function(data,output,token="words", ...){
  d %>% 
    select(txt) %>% 
    tidytext::unnest_tokens(output, input="txt", token = token, ...) %>% 
    n_distinct()
}

d <- tibble(txt = prideprejudice)

count_tokens(d, "word", to_lower = FALSE)
#> [1] 6915

count_tokens(d, "ngram", token = "ngrams", to_lower = FALSE, n = 8)
#> [1] 122189

count_tokens(d, "ngram", token = "ngrams", to_lower = FALSE, n = 5)
#> [1] 121599

count_tokens(d, "ngram", token = "ngrams", to_lower = FALSE, n = 3)
#> [1] 104664

Created on 2021-02-03 by the reprex package (v0.3.0)

Upvotes: 2

Related Questions