Rob Ackland
Rob Ackland

Reputation: 31

Unexpected behaviour with dfm_lookup - ordering of entries affects feature frequency counts

I am using quanteda 4.1.0 and getting some unexpected behaviour when using a dictionary to adjust for synonyms and plurals. The ordering of the entries in the dictionary is affecting the frequency count of features.

In the example below, "banana" and its plural appears 3 times while "apple" and its plural appears twice. But I only get the correct frequency counts when the dictionary has "apple" listed before "banana". So it seems the alphabetical ordering of entries in the dictionary affects the behaviour of dfm_lookup()?

library(quanteda)
library(quanteda.textstats)

dfmat <- dfm(tokens(c("I like apples, but I don't like apple pie. Bananas are OK",
                      "I like bananas, but I don't like banana fritter.")))

textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
#    feature frequency rank docfreq group
# 7  bananas         2    3       2   all
# 8   apples         1    8       1   all
# 9    apple         1    8       1   all
# 13  banana         1    8       1   all

#With wildcards
#This works - expected behaviour
dict <- dictionary(list(apple = c("apple*"),
                        banana = c("banana*")))
dfmat <-  dfm_lookup(dfmat,
                    dictionary = dict, exclusive = FALSE, capkeys = FALSE)

textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
#   feature frequency rank docfreq group
# 3  banana         3    3       2   all
# 4   apple         2    4       1   all


#This doesn't work - unexpected behaviour
dict <- dictionary(list(banana = c("banana*"),
                        apple = c("apple*")))

dfmat <-  dfm_lookup(dfmat,
                    dictionary = dict, exclusive = FALSE, capkeys = FALSE)

textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
#   feature frequency rank docfreq group
# 3   apple         3    3       2   all
# 4  banana         2    4       1   all


#Without wildcards - get the same (puzzling) behaviour
#This works
#dict <- dictionary(list(apple = c("apple","apples"),
#                        banana = c("banana","bananas")))
#This doesn't work
#dict <- dictionary(list(banana = c("banana","bananas"),
#                        apple = c("apple","apples")))

Upvotes: 1

Views: 59

Answers (2)

Kohei Watanabe
Kohei Watanabe

Reputation: 890

I think it is a bug. dfmat1 and dfmat2 should be identical, but they are not. Until this is fixed, please use tokens_lookup().

library(quanteda)
#> Package version: 4.1.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.

toks <- tokens(c("I like apples, but I don't like apple pie. Bananas are OK",
                 "I like bananas, but I don't like banana fritter."))
dfmat <- dfm(toks)

dict <- dictionary(list(apple = c("apple*"),
                        banana = c("banana*")))
dfmat1 <-  dfm_lookup(dfmat,
                     dictionary = dict, exclusive = FALSE, capkeys = FALSE)

dfmat2 <-  dfm_lookup(dfmat,
                      dictionary = rev(dict), exclusive = FALSE, capkeys = FALSE)

identical(as.matrix(dfmat1), as.matrix(dfmat2))
#> [1] FALSE

dfmat3 <-  dfm(tokens_lookup(toks, dictionary = rev(dict), 
                             exclusive = FALSE, capkeys = FALSE))

identical(as.matrix(dfmat1), as.matrix(dfmat3))
#> [1] TRUE

Upvotes: 2

Joseph Wilson
Joseph Wilson

Reputation: 11

I don't have too much experience with quanteda but I will (perhaps unwisely) share what I've discovered in relation to this question anyway.

It seems the order of the entries in the dictionary are absolutely having an impact on the outcome of dfm_lookup which at least to me is unexpected (perhaps unintended?).

Firstly, this unexpected behavior does not appear to be the result of alphabetical ordering, instead it seems to have something to do with the order of the terms in the document frequency matrix. Check out the below example:

feature_matrix <- dfm(tokens(c("foo1 foo2 foo3",
                          "bar1 bar2 bar3 foo4")))

# Entries in dictionary in alphabetical order, however resultant dfm clearly wrong
dict1 <- dictionary(list(bar = c("bar*"),foo = c("foo*")))
feature_matrix1 <- dfm_lookup(feature_matrix, dictionary = dict1, exclusive = FALSE, capkeys = FALSE)
feature_matrix1
#docs    bar foo
#text1   3   0
#text2   0   4

# Entries in dictionary not in alphabetical order, however resultant dfm correct
dict2 <- dictionary(list(foo = c("foo*"), bar = c("bar*")))
feature_matrix2 <-  dfm_lookup(feature_matrix, dictionary = dict2, exclusive = FALSE, capkeys = FALSE)
feature_matrix2
#docs    foo bar
#text1   3   0
#text2   1   3

I also had a little look at the source code for dfm_lookup and think I have a quick fix (one that hasn't been fully tested and I'm not extremely confident in, but at least it works for a few examples). I only changed one line of code and it seems to address the unexpected behavior in both these examples.

my_dfm_lookup <- function(x, dictionary, levels = 1:5,
                           exclusive = TRUE,
                           valuetype = c("glob", "regex", "fixed"),
                           case_insensitive = TRUE,
                           capkeys = !exclusive,
                           nomatch = NULL,
                           verbose = quanteda_options("verbose")) {
  
  x <- as.dfm(x)
  exclusive <- check_logical(exclusive)
  capkeys <- check_logical(capkeys)
  verbose <- check_logical(verbose)
  
  if (!nfeat(x) || !ndoc(x)) return(x)
  
  if (!is.dictionary(dictionary))
    stop("dictionary must be a dictionary object")
  
  valuetype <- match.arg(valuetype)
  type <- colnames(x)
  attrs <- attributes(x)
  
  if (verbose)
    catm("applying a dictionary consisting of ", length(dictionary), " key",
         if (length(dictionary) > 1L) "s" else "", "\n", sep = "")
  
  ids <- object2id(dictionary, type, valuetype, case_insensitive,
                   quanteda:::field_object(attrs, "concatenator"), levels)
  
  # flag nested patterns
  if (length(ids)) {
    m <- factor(names(ids), levels = unique(names(ids)))
    dup <- unlist(lapply(split(ids, m), duplicated), use.names = FALSE)
  } else {
    dup <- logical()
  }
  
  key <- attr(ids, "key")
  ids <- ids[lengths(ids) == 1 & !dup] # drop phrasal and nested patterns
  id_key <- match(names(ids), key)
  id <- unlist(ids, use.names = FALSE)
  if (capkeys)
    key <- char_toupper(key)
  if (length(id)) {
    if (exclusive) {
      if (!is.null(nomatch)) {
        id_nomatch <- setdiff(seq_len(nfeat(x)), id)
        id <- c(id, id_nomatch)
        id_key <- c(id_key, rep(length(key) + 1,
                                length(id_nomatch)))
        key <- c(key, nomatch[1])
      }
      col_new <- key[id_key]
      x <- x[, id]
      set_dfm_featnames(x) <- col_new
      # merge identical keys and add non-existent keys
      result <- dfm_match(dfm_compress(x, margin = "features"), key)
    } else {
      if (!is.null(nomatch))
        warning("nomatch only applies if exclusive = TRUE")
      col_new <- type
      
      # repeat columns for multiple keys
      if (any(duplicated(id))) {
        ids_rep <- as.list(seq_len(nfeat(x)))
        ids_rep[unique(id)] <- split(id, id)
        id_rep <- unlist(ids_rep, use.names = FALSE)
      } else {
        id_rep <- seq_len(nfeat(x))
      }
      col_new <- col_new[id_rep]
      # This is the only meaningful change I made to the quanteda function
      # originally it was:
      # col_new[id_rep %in% id] <- key[id_key]
      # But I believe this may be where our unexpected behavior is getting in,
      # begin change
      col_new[id_rep %in% id] <- key[id_key[order(id)]]
      # end change
      x <- x[,id_rep]
      
      quanteda:::set_dfm_featnames(x) <- col_new
      result <- dfm_compress(x, margin = "features")
    }
    
  } else {
    if (exclusive) {
      if (!is.null(nomatch)) {
        result <- as.dfm(matrix(ntoken(x), ncol = 1,
                                dimnames = list(docnames(x), nomatch)))
      } else {
        result <- make_null_dfm(document = docnames(x), 
                                feature = key)
      }
    } else {
      result <- x
    }
  }
  if (exclusive)
    field_object(attrs, "what") <- "dictionary"
  quanteda:::rebuild_dfm(result, attrs)
}

Upvotes: 1

Related Questions