Creating and computing percentage of co-occurrence based on keywords

Question

I have the following data set:

df <- data.frame (text  = c("House Sky Blue",
                            "House Sky Green",
                            "House Sky Red",
                            "House Sky Yellow",
                            "House Sky Green",
                            "House Sky Glue",
                            "House Sky Green"))

I'd like to find the percentage of co-occurrence of some terms of tokens. For example, out of all documents, where can I find the token "House" and at the same time how many of them also include the term "Green"?

In out data we have 7 documents that have the term House and 3 out of those 7 p=(100*3/7) also include the term Green, It would be so nice to see also what terms or tokens appear within some p threshold along side the token "House".

I have used these two functions:

textstat_collocations(tokens)

> textstat_collocations(tokens)
  collocation count count_nested length   lambda        z
1   house sky     7            0      2 5.416100 2.622058
2   sky green     3            0      2 2.456736 1.511653

Fun textstat_simil

textstat_simil(dfm(tokens),margin="features")

textstat_simil object; method = "correlation"
       house sky   blue  green    red yellow   glue
house    NaN NaN    NaN    NaN    NaN    NaN    NaN
sky      NaN NaN    NaN    NaN    NaN    NaN    NaN
blue     NaN NaN  1.000 -0.354 -0.167 -0.167 -0.167
green    NaN NaN -0.354  1.000 -0.354 -0.354 -0.354
red      NaN NaN -0.167 -0.354  1.000 -0.167 -0.167
yellow   NaN NaN -0.167 -0.354 -0.167  1.000 -0.167
glue     NaN NaN -0.167 -0.354 -0.167 -0.167  1.000

but they do not seem to give my desired output also I wonder why the correlation btw green and house is NaN for the textsats_simil fun

My desired output would show the following info:

feature="House"
 percentage of co-occurrence 

Green = 3/7
Blue= 1/7
Red = 1/7
Yellow = 1/7
Glue = 1/7

In the quetda docs I can't seem to find a function that can give me my desired output, although I know there must be a way around since I find this library to be so fast and complete.

Ken Benoit · Accepted Answer

One way to do this is using the fcm() to get document-level co-occurrences for a target feature. Below, I show how to do this using fcm(), fcm_remove() to remove the target feature, then a loop to get the desired printed output.

library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

df <- data.frame(text = c("House Sky Blue",
                          "House Sky Green",
                          "House Sky Red",
                          "House Sky Yellow",
                          "House Sky Green",
                          "House Sky Glue",
                          "House Sky Green"))
corp <- corpus(df)

coocc_fract <- function(corp, feature) {
   # create a document-level co-occurrence matrix
   fcmat <- fcm(dfm(tokens(corp), tolower = FALSE), context = "document")
   # select for the given feature
   fcmat <- fcm_remove(fcmat, feature)
   cat("feature=\"", feature, "\"
", sep = "")
   cat(" percentage of co-occurrence

")
   for (f in featnames(fcmat)) {
       # skip zeroes
       freq <- as.character(fcmat[1, f])
       if (freq != "0") {
           cat(f, " = ", as.character(fcmat[1, f]), "/", ndoc(corp), 
               "
", sep = "")
       }
   }
}

This produces this output:

coocc_fract(corp, feature = "House")
#> feature="House"
#>  percentage of co-occurrence
#> 
#> Blue = 1/7
#> Green = 3/7
#> Red = 1/7
#> Yellow = 1/7
#> Glue = 1/7

^{Created on 2023-01-02 with reprex v2.0.2}

Creating and computing percentage of co-occurrence based on keywords

Answers (2)

Related Questions