R_Student
R_Student

Reputation: 789

Creating and computing percentage of co-occurrence based on keywords

I have the following data set:

df <- data.frame (text  = c("House Sky Blue",
                            "House Sky Green",
                            "House Sky Red",
                            "House Sky Yellow",
                            "House Sky Green",
                            "House Sky Glue",
                            "House Sky Green"))

I'd like to find the percentage of co-occurrence of some terms of tokens. For example, out of all documents, where can I find the token "House" and at the same time how many of them also include the term "Green"?

In out data we have 7 documents that have the term House and 3 out of those 7 p=(100*3/7) also include the term Green, It would be so nice to see also what terms or tokens appear within some p threshold along side the token "House".

I have used these two functions:

textstat_collocations(tokens)

> textstat_collocations(tokens)
  collocation count count_nested length   lambda        z
1   house sky     7            0      2 5.416100 2.622058
2   sky green     3            0      2 2.456736 1.511653

Fun textstat_simil

textstat_simil(dfm(tokens),margin="features")

textstat_simil object; method = "correlation"
       house sky   blue  green    red yellow   glue
house    NaN NaN    NaN    NaN    NaN    NaN    NaN
sky      NaN NaN    NaN    NaN    NaN    NaN    NaN
blue     NaN NaN  1.000 -0.354 -0.167 -0.167 -0.167
green    NaN NaN -0.354  1.000 -0.354 -0.354 -0.354
red      NaN NaN -0.167 -0.354  1.000 -0.167 -0.167
yellow   NaN NaN -0.167 -0.354 -0.167  1.000 -0.167
glue     NaN NaN -0.167 -0.354 -0.167 -0.167  1.000

but they do not seem to give my desired output also I wonder why the correlation btw green and house is NaN for the textsats_simil fun

My desired output would show the following info:

feature="House"
 percentage of co-occurrence 

Green = 3/7
Blue= 1/7
Red = 1/7
Yellow = 1/7
Glue = 1/7

In the quetda docs I can't seem to find a function that can give me my desired output, although I know there must be a way around since I find this library to be so fast and complete.

Upvotes: 0

Views: 146

Answers (2)

Ken Benoit
Ken Benoit

Reputation: 14902

One way to do this is using the fcm() to get document-level co-occurrences for a target feature. Below, I show how to do this using fcm(), fcm_remove() to remove the target feature, then a loop to get the desired printed output.

library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

df <- data.frame(text = c("House Sky Blue",
                          "House Sky Green",
                          "House Sky Red",
                          "House Sky Yellow",
                          "House Sky Green",
                          "House Sky Glue",
                          "House Sky Green"))
corp <- corpus(df)

coocc_fract <- function(corp, feature) {
   # create a document-level co-occurrence matrix
   fcmat <- fcm(dfm(tokens(corp), tolower = FALSE), context = "document")
   # select for the given feature
   fcmat <- fcm_remove(fcmat, feature)
   cat("feature=\"", feature, "\"\n", sep = "")
   cat(" percentage of co-occurrence\n\n")
   for (f in featnames(fcmat)) {
       # skip zeroes
       freq <- as.character(fcmat[1, f])
       if (freq != "0") {
           cat(f, " = ", as.character(fcmat[1, f]), "/", ndoc(corp), 
               "\n", sep = "")
       }
   }
}

This produces this output:

coocc_fract(corp, feature = "House")
#> feature="House"
#>  percentage of co-occurrence
#> 
#> Blue = 1/7
#> Green = 3/7
#> Red = 1/7
#> Yellow = 1/7
#> Glue = 1/7

Created on 2023-01-02 with reprex v2.0.2

Upvotes: 2

phiver
phiver

Reputation: 23598

I couldn't find anything inside quanteda, so I cobbled something together. One function to create a list object with the chosen word and frequency table and one print function to print the output like you want. You can adjust the functions to just return what you want and add more tests to check on the inputs.

Code part:

dat <- data.frame (text  = c("House Sky Blue",
                            "House Sky Green",
                            "House Sky Red",
                            "House Sky Yellow",
                            "House Sky Green",
                            "House Sky Glue",
                            "House Sky Green"))


library(quanteda)
library(quanteda.textstats)

my_dfm <- dfm(tokens(corpus(dat)))
freqs <- textstat_frequency(my_dfm)

# create function to return a list with the chosen word and a frequency table    
create_co_occurrence <- function(x, word) {
  
  if(!inherits(x, "frequency")) {
    stop("x must be a frequency table generated by textstat_frequency." 
         ,call. = FALSE)
  }
  
  # add check to see if word is a character
  
  input <- x
  
  word_frequency <- input$frequency[input$feature == word]
  
  out <- input[input$feature != word, ]
  out$percentage <- out$frequency / word_frequency
  out <- out[, c("feature", "percentage")]
  # reset row.names
  row.names(out) <- NULL

  out_list <- list(word = word,
                   co_occurrence = out)
    
  class(out_list) <- c("co_occurrence", "list")
  out_list
}

# create print function.
print.co_occurrence <- function(x, ...) {
  
  writeLines(sprintf("feature = %s"  , x$word))
  writeLines("percentage of co-occurrence
             ")
  print.data.frame(x$co_occurrence)
}

output:

test <- create_co_occurrence(freqs, "house")

# calling test will activate the print.co_occurrence function and format the results
test

feature = house
percentage of co-occurrence
             
  feature percentage
2     sky  1.0000000
3   green  0.4285714
4    blue  0.1428571
5     red  0.1428571
6  yellow  0.1428571
7    glue  0.1428571

Upvotes: 2

Related Questions