Shohei Doi
Shohei Doi

Reputation: 23

tokens_compound() in quanteda changes the order of features

I found tokens_compound() in quanteda changes the order of tokens across different R sessions. That is, the result varies every time after restarting a session even if a seed value is fixed, though it does not change in a single session.

Here is the replication procedure:

  1. Find collocations, compound tokens, and save them.
library(quanteda)

set.seed(12345)

data(data_corpus_inaugural)

toks <- data_corpus_inaugural %>% 
  tokens(remove_punct = TRUE,
         remove_symbol = TRUE, 
         padding = TRUE) %>% 
  tokens_tolower()

col <- toks %>% 
  textstat_collocations()

toks.col <- toks %>%
  tokens_compound(pattern = col[col$z > 3])

write(attr(toks.col, "types"), "col1.txt")
  1. End and restart R session and run the above code again with "col1.txt" replaced by "col2.txt".

  2. Compare the two sets of tokens and find they are different.

col1 <- read.table("col1.txt")
col2 <- read.table("col2.txt")

identical(col1$V1, col2$V1) # This should return FALSE.

col1$V1[head(which(col1$V1 != col2$V1))]
col2$V1[head(which(col1$V1 != col2$V1))]

This does not matter for many cases but the result of LDA (by {topicmodels}) changes in different sessions. I guess so because the result of LDA is constant if I reset the order of features in tokens by as.list() and thereafter as.tokens() (dfm_sort() does not work for this).

I wonder whether this happens only for me (Ubuntu 18.04.5, R 4.0.4, and quanteda 2.1.2) and would be happy to hear another (easier) solution.

Updated on Feb 20

For example, the output of LDA is not reproduced.

lis <- list()
for (i in seq_len(2)) {
  set.seed(123)
  lis[[i]] <- tokens_compound(toks, pattern = col[col$z > 3]) %>% 
    dfm() %>% 
    convert(to = "topicmodels") %>% 
    LDA(k = 5,
        method = "Gibbs",
        control = list(seed = 12345,
                       iter = 100))
}

head(lis[[1]]@gamma)
head(lis[[2]]@gamma)

Upvotes: 2

Views: 201

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

An interesting investigation but this is neither an error nor anything to be concerned with. Within a quanteda tokens object, the types are not determinate in order, after a processing step such as textstat_compound(). This is because this function is parallelised in C++ and how these threads operate is not fixed by set.seed() from R. But this will not affect the important part, which is the set of types, or anything about the tokens themselves. If you want the order of the types that you extract to be the same, then you should sort them upon extraction.

library("quanteda")
## Package version: 2.1.2

toks <- data_corpus_inaugural %>%
  tokens(
    remove_punct = TRUE,
    remove_symbol = TRUE,
    padding = TRUE
  ) %>%
  tokens_tolower()
col <- quanteda.textstats::textstat_collocations(toks)

It turns out that you do not need to save the output or restart R - this happens within a single session.

# types are differently indexed, but are the same set
lis <- list()
for (i in seq_len(2)) {
  set.seed(123)
  toks.col <- tokens_compound(toks, pattern = col[col$z > 3])
  lis <- c(lis, list(types = types(toks.col)))
}
dframe <- data.frame(lis)

sum(dframe$types != dframe$types.1)
## [1] 19898
head(dframe[dframe$types != dframe$types.1, ])
##                                            types              types.1
## 8897                              at_this_second   my_fellow_citizens
## 8898 to_take_the_oath_of_the_presidential_office            no_people
## 8899                                    there_is             on_earth
## 8900                                occasion_for cause_to_be_thankful
## 8901                                 an_extended         this_is_said
## 8902                                   there_was            spirit_of

However the (unordered) set of types is identical:

# but
setequal(dframe$types, dframe$types.1)
## [1] TRUE

More important is that when we compare the values of each token, which is ordered, these are identical:

# tokens are the same
lis <- list()
for (i in seq_len(2)) {
  set.seed(123)
  toks.col <- tokens_compound(toks, pattern = col[col$z > 3])
  lis <- c(lis, list(toks = as.character(toks.col)))
}
dframe <- data.frame(lis)
all.equal(dframe$toks, dframe$toks.1)
## [1] TRUE

Created on 2021-02-18 by the reprex package (v1.0.0)

An additional comment, whose importance is underscored by this analysis: We strongly discourage direct access to object attributes. Use types(x) as above, not attr(x, "types"). The former will always work. The latter relies on our implementation of the object, which may change as we improve the package.

Upvotes: 5

Related Questions