Reputation: 789
I have the following data set:
df <- data.frame (text = c("House Sky Blue",
"House Sky Green",
"House Sky Red",
"House Sky Yellow",
"House Sky Green",
"House Sky Glue",
"House Sky Green"))
I'd like to find the percentage of co-occurrence of some terms of tokens. For example, out of all documents, where can I find the token "House" and at the same time how many of them also include the term "Green"?
In out data we have 7 documents that have the term House and 3 out of those 7 p=(100*3/7) also include the term Green, It would be so nice to see also what terms or tokens appear within some p threshold along side the token "House".
I have used these two functions:
textstat_collocations(tokens)
> textstat_collocations(tokens)
collocation count count_nested length lambda z
1 house sky 7 0 2 5.416100 2.622058
2 sky green 3 0 2 2.456736 1.511653
Fun textstat_simil
textstat_simil(dfm(tokens),margin="features")
textstat_simil object; method = "correlation"
house sky blue green red yellow glue
house NaN NaN NaN NaN NaN NaN NaN
sky NaN NaN NaN NaN NaN NaN NaN
blue NaN NaN 1.000 -0.354 -0.167 -0.167 -0.167
green NaN NaN -0.354 1.000 -0.354 -0.354 -0.354
red NaN NaN -0.167 -0.354 1.000 -0.167 -0.167
yellow NaN NaN -0.167 -0.354 -0.167 1.000 -0.167
glue NaN NaN -0.167 -0.354 -0.167 -0.167 1.000
but they do not seem to give my desired output also I wonder why the correlation btw green and house is NaN for the textsats_simil
fun
My desired output would show the following info:
feature="House"
percentage of co-occurrence
Green = 3/7
Blue= 1/7
Red = 1/7
Yellow = 1/7
Glue = 1/7
In the quetda docs I can't seem to find a function that can give me my desired output, although I know there must be a way around since I find this library to be so fast and complete.
Upvotes: 0
Views: 146
Reputation: 14902
One way to do this is using the fcm()
to get document-level co-occurrences for a target feature. Below, I show how to do this using fcm()
, fcm_remove()
to remove the target feature, then a loop to get the desired printed output.
library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
df <- data.frame(text = c("House Sky Blue",
"House Sky Green",
"House Sky Red",
"House Sky Yellow",
"House Sky Green",
"House Sky Glue",
"House Sky Green"))
corp <- corpus(df)
coocc_fract <- function(corp, feature) {
# create a document-level co-occurrence matrix
fcmat <- fcm(dfm(tokens(corp), tolower = FALSE), context = "document")
# select for the given feature
fcmat <- fcm_remove(fcmat, feature)
cat("feature=\"", feature, "\"\n", sep = "")
cat(" percentage of co-occurrence\n\n")
for (f in featnames(fcmat)) {
# skip zeroes
freq <- as.character(fcmat[1, f])
if (freq != "0") {
cat(f, " = ", as.character(fcmat[1, f]), "/", ndoc(corp),
"\n", sep = "")
}
}
}
This produces this output:
coocc_fract(corp, feature = "House")
#> feature="House"
#> percentage of co-occurrence
#>
#> Blue = 1/7
#> Green = 3/7
#> Red = 1/7
#> Yellow = 1/7
#> Glue = 1/7
Created on 2023-01-02 with reprex v2.0.2
Upvotes: 2
Reputation: 23598
I couldn't find anything inside quanteda, so I cobbled something together. One function to create a list object with the chosen word and frequency table and one print function to print the output like you want. You can adjust the functions to just return what you want and add more tests to check on the inputs.
Code part:
dat <- data.frame (text = c("House Sky Blue",
"House Sky Green",
"House Sky Red",
"House Sky Yellow",
"House Sky Green",
"House Sky Glue",
"House Sky Green"))
library(quanteda)
library(quanteda.textstats)
my_dfm <- dfm(tokens(corpus(dat)))
freqs <- textstat_frequency(my_dfm)
# create function to return a list with the chosen word and a frequency table
create_co_occurrence <- function(x, word) {
if(!inherits(x, "frequency")) {
stop("x must be a frequency table generated by textstat_frequency."
,call. = FALSE)
}
# add check to see if word is a character
input <- x
word_frequency <- input$frequency[input$feature == word]
out <- input[input$feature != word, ]
out$percentage <- out$frequency / word_frequency
out <- out[, c("feature", "percentage")]
# reset row.names
row.names(out) <- NULL
out_list <- list(word = word,
co_occurrence = out)
class(out_list) <- c("co_occurrence", "list")
out_list
}
# create print function.
print.co_occurrence <- function(x, ...) {
writeLines(sprintf("feature = %s" , x$word))
writeLines("percentage of co-occurrence
")
print.data.frame(x$co_occurrence)
}
output:
test <- create_co_occurrence(freqs, "house")
# calling test will activate the print.co_occurrence function and format the results
test
feature = house
percentage of co-occurrence
feature percentage
2 sky 1.0000000
3 green 0.4285714
4 blue 0.1428571
5 red 0.1428571
6 yellow 0.1428571
7 glue 0.1428571
Upvotes: 2