bgreen
bgreen

Reputation: 87

Quanteda overlap frequency reporting problem

I’m hoping to identify an apparent miscalculation due to overlapping terms. In the dummy data set and code below, which uses the same code on the actual data, the analysis works as expected. “Australia” is identified as occurring twice and “Australia Post” once.

library (quanteda)

txt <- c(doc1 = "There are a lot of dairy farms in Australia.",
         doc2 = "Dairy comprise a significant percent of all farms in Australia.",
         doc3 = "Not all farms are dairy farms, in fact, most  farms are not.",
         doc4 = "Some are lucky to receive a service from Australia Post.")

x <- tokens(txt, remove_punct = TRUE)

x1 <- dfm(x)
t1 <- textstat_frequency(x1)
t1

dict  <- dictionary(list(dairy = c("dairy", "dairy farms","dairy farm","milk"),
                         auspost = "australia post",
                         aus = c("australia", "this country", "our country"),
                         farmers = c("farmers", "farmer", "farm", "farms")))

kwicdict <- kwic(x, pattern = dict, window = 3)
kwicdict 

##Count dictionary key 
dfm1 <- dfm(tokens_lookup(x, dictionary = dict))
d1  <- convert(dfm1, to = "data.frame")
d1

## Count phrases (e.g. phrases that comprise the key)
dfm2 <- dfm(tokens_compound(x, dict)) %>% 
  dfm_select(dict)

dat22 <- convert(dfm2, to = "data.frame")
head(dat22)

However, in the actual data when I use the textstat_frequency command “Australia” occurs 1716 times, n-grams “this country = 526 times and “our country” = 91 times. This total 2333. “Australia Post” occurs 145 times. When I run the same code to count phrases, “aus” is recorded as occurring 2333 times and “australia post” 145 times. It would seem that the 145 times “Australia” is included in ”Australia post” is still being counted , under “aus”. I believe the total should be 2188.

I can’t work out why the cde works on the dummy data and not on the actual data. Any advice is appreciated.

Upvotes: 1

Views: 60

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

Lookup functions such as tokens_lookup() and kwic() will (by default) match each value of the dictionary to each occurrence found. That means that if a value (e.g., "Australia") occurs both within a phrase and on its own in more than one dictionary keys, it will be once for the compound (e.g., Australia Post") and once for the individual match. You can control this behaviour in for instance tokens_lookup() using nested scope = "dictionary" but it will still replace the values with keys. This functionality does not exist in kwic().

So how about a different way of counting your frequencies, using compounding to control that is nested? Here's how:

> tokens_compound(x, dict, concatenator = " ") |>
    dfm() |>
    t() |>
    print(-1, -1)

Document-feature matrix of: 24 documents, 4 features (63.54% sparse) and 0 docvars.
                features
docs             doc1 doc2 doc3 doc4
  there             1    0    0    0
  are               1    0    2    1
  a                 1    1    0    1
  lot               1    0    0    0
  of                1    1    0    0
  dairy farms       1    0    1    0
  in                1    1    1    0
  australia         1    1    0    0
  dairy             0    1    0    0
  comprise          0    1    0    0
  significant       0    1    0    0
  percent           0    1    0    0
  all               0    1    1    0
  farms             0    1    2    0
  not               0    0    2    0
  fact              0    0    1    0
  most              0    0    1    0
  some              0    0    0    1
  lucky             0    0    0    1
  to                0    0    0    1
  receive           0    0    0    1
  service           0    0    0    1
  from              0    0    0    1
  australia post    0    0    0    1

Here, the "Australia" in "Australia Post" is not counted as "Australia".

Upvotes: 0

Related Questions