John
John

Reputation: 109

Count certain letters in each document in a Quanteda corpus

Specifically, I need to count the frequencies of each vowel in each document: e and i as "high" vowels; a, o, and u as "low" vowels.

Is there a way the count the frequencies of certain letters in each document in a quanteda corpus in R? So far, I have only encountered functions that operate on word or sentence level, like token_select() or ntoken().

Any help is welcome. I considered a regex pattern, but I'm not sure how to apply it to each individual document in a Quanteda corpus and get a count from it.

Here is a minimum working example to play around with:

require(quanteda)

text1 <- "This is some gibberish for you."
text2 <- "Some more gibberish. Enjoy!"
text3 <- "Gibber, gibber, gibber away."

corp <- rbind(text1, text2, text3) %>% 
  quanteda::corpus() 

Upvotes: 2

Views: 180

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

You want to tokenize the texts as characters, then use a dictionary mapping the vowels to two categories of high and low vowels. Here's how:

library("quanteda")
## Package version: 2.1.2

text1 <- "This is some gibberish for you."
text2 <- "Some more gibberish. Enjoy!"
text3 <- "Gibber, gibber, gibber away."

corp <- corpus(c(text1, text2, text3))

toks <- tokens(corp, what = "character")
dict <- dictionary(list(
  high_vowels = c("e", "i"),
  low_vowels = c("a", "o", "u")
))

tokens_lookup(toks, dict) %>%
  dfm()
## Document-feature matrix of: 3 documents, 2 features (0.0% sparse).
##        features
## docs    high_vowels low_vowels
##   text1           6          4
##   text2           6          3
##   text3           6          2

Upvotes: 2

Related Questions