Reputation: 1
I want to replicate a measure of common words from a Paper in R.
They describe their procedure as follows: "To construct Common words,..., we first determine the relative frequency of all words occurring in all documents. We then calculate Common words as the average of this proportion for every word occurring in a given document. The higher the value of common words, the more ordinary is the documents’s language and thus the more readable it should be." (Loughran & McDonald 2014)
Can anybody help me with this? I work with corpus objects in order to make analysis with the text documents in R.
I have already computed the relative frequency of all words occurring in all documents as follows:
dfm_Notes_Summary <- dfm(tokens_Notes_Summary)
Summary_FreqStats_Notes <- textstat_frequency(dfm_Notes_Summary)
Summary_FreqStats_Notes$RelativeFreq <- Summary_FreqStats_Notes$frequency/sum(Summary_FreqStats_Notes$frequency)
-> I basically transformed the tokens object (tokens_Notes_Summary) into an dfm Object (dfm_Notes_Summary) and got the relative frequency of all words in all documents.
Now I struggle to calculate the average of this proportion for every word occurring in a given document.
Upvotes: 0
Views: 233
Reputation: 14902
I reread Loughran and McDonald (2014) meant, since I could not find code for that, but I think it's based on the average of a document's terms' document frequencies. The code will probably make this more clear:
library("quanteda")
#> Package version: 3.2.3
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
dfmat <- data_corpus_inaugural |>
head(5) |>
tokens(remove_punct = TRUE, remove_numbers = TRUE) |>
dfm()
readability_commonwords <- function(x) {
# compute document frequencies of all features
relative_docfreq <- docfreq(x) / nfeat(x)
# average of all words by the relative document frequency
result <- x %*% relative_docfreq
# return as a named vector
structure(result[, 1], names = rownames(result))
}
readability_commonwords(dfmat)
#> 1789-Washington 1793-Washington 1797-Adams 1801-Jefferson 1805-Jefferson
#> 2.6090768 0.2738525 4.2026818 3.0928314 3.8256833
To know full details though you should ask the authors.
Created on 2022-11-30 with reprex v2.0.2
Upvotes: 0