Quanteda overlap frequency reporting problem

Question

I’m hoping to identify an apparent miscalculation due to overlapping terms. In the dummy data set and code below, which uses the same code on the actual data, the analysis works as expected. “Australia” is identified as occurring twice and “Australia Post” once.

library (quanteda)

txt <- c(doc1 = "There are a lot of dairy farms in Australia.",
         doc2 = "Dairy comprise a significant percent of all farms in Australia.",
         doc3 = "Not all farms are dairy farms, in fact, most  farms are not.",
         doc4 = "Some are lucky to receive a service from Australia Post.")

x <- tokens(txt, remove_punct = TRUE)

x1 <- dfm(x)
t1 <- textstat_frequency(x1)
t1

dict  <- dictionary(list(dairy = c("dairy", "dairy farms","dairy farm","milk"),
                         auspost = "australia post",
                         aus = c("australia", "this country", "our country"),
                         farmers = c("farmers", "farmer", "farm", "farms")))

kwicdict <- kwic(x, pattern = dict, window = 3)
kwicdict 

##Count dictionary key 
dfm1 <- dfm(tokens_lookup(x, dictionary = dict))
d1  <- convert(dfm1, to = "data.frame")
d1

## Count phrases (e.g. phrases that comprise the key)
dfm2 <- dfm(tokens_compound(x, dict)) %>% 
  dfm_select(dict)

dat22 <- convert(dfm2, to = "data.frame")
head(dat22)

However, in the actual data when I use the textstat_frequency command “Australia” occurs 1716 times, n-grams “this country = 526 times and “our country” = 91 times. This total 2333. “Australia Post” occurs 145 times. When I run the same code to count phrases, “aus” is recorded as occurring 2333 times and “australia post” 145 times. It would seem that the 145 times “Australia” is included in ”Australia post” is still being counted , under “aus”. I believe the total should be 2188.

I can’t work out why the cde works on the dummy data and not on the actual data. Any advice is appreciated.

Quanteda overlap frequency reporting problem

Answers (1)

Related Questions