Unexpected behaviour with dfm_lookup - ordering of entries affects feature frequency counts

Question

I am using quanteda 4.1.0 and getting some unexpected behaviour when using a dictionary to adjust for synonyms and plurals. The ordering of the entries in the dictionary is affecting the frequency count of features.

In the example below, "banana" and its plural appears 3 times while "apple" and its plural appears twice. But I only get the correct frequency counts when the dictionary has "apple" listed before "banana". So it seems the alphabetical ordering of entries in the dictionary affects the behaviour of dfm_lookup()?

library(quanteda)
library(quanteda.textstats)

dfmat <- dfm(tokens(c("I like apples, but I don't like apple pie. Bananas are OK",
                      "I like bananas, but I don't like banana fritter.")))

textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
#    feature frequency rank docfreq group
# 7  bananas         2    3       2   all
# 8   apples         1    8       1   all
# 9    apple         1    8       1   all
# 13  banana         1    8       1   all

#With wildcards
#This works - expected behaviour
dict <- dictionary(list(apple = c("apple*"),
                        banana = c("banana*")))
dfmat <-  dfm_lookup(dfmat,
                    dictionary = dict, exclusive = FALSE, capkeys = FALSE)

textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
#   feature frequency rank docfreq group
# 3  banana         3    3       2   all
# 4   apple         2    4       1   all


#This doesn't work - unexpected behaviour
dict <- dictionary(list(banana = c("banana*"),
                        apple = c("apple*")))

dfmat <-  dfm_lookup(dfmat,
                    dictionary = dict, exclusive = FALSE, capkeys = FALSE)

textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
#   feature frequency rank docfreq group
# 3   apple         3    3       2   all
# 4  banana         2    4       1   all


#Without wildcards - get the same (puzzling) behaviour
#This works
#dict <- dictionary(list(apple = c("apple","apples"),
#                        banana = c("banana","bananas")))
#This doesn't work
#dict <- dictionary(list(banana = c("banana","bananas"),
#                        apple = c("apple","apples")))

Kohei Watanabe · Accepted Answer

I think it is a bug. dfmat1 and dfmat2 should be identical, but they are not. Until this is fixed, please use tokens_lookup().

library(quanteda)
#> Package version: 4.1.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.

toks <- tokens(c("I like apples, but I don't like apple pie. Bananas are OK",
                 "I like bananas, but I don't like banana fritter."))
dfmat <- dfm(toks)

dict <- dictionary(list(apple = c("apple*"),
                        banana = c("banana*")))
dfmat1 <-  dfm_lookup(dfmat,
                     dictionary = dict, exclusive = FALSE, capkeys = FALSE)

dfmat2 <-  dfm_lookup(dfmat,
                      dictionary = rev(dict), exclusive = FALSE, capkeys = FALSE)

identical(as.matrix(dfmat1), as.matrix(dfmat2))
#> [1] FALSE

dfmat3 <-  dfm(tokens_lookup(toks, dictionary = rev(dict), 
                             exclusive = FALSE, capkeys = FALSE))

identical(as.matrix(dfmat1), as.matrix(dfmat3))
#> [1] TRUE

Unexpected behaviour with dfm_lookup - ordering of entries affects feature frequency counts

Answers (2)

Related Questions