user3709049
user3709049

Reputation:

r Text Mining: Finding the Frequency of Character Patterns

I was trying to find the frequency of character patterns (word parts) in a large data set.

For example, I have a list of the following in a csv file:

Is there a way to find the frequency of all character combinations? Like:

Update: This is what I have for finding the frequency of all character patterns of length three in my data:

threecombo  <- do.call(paste0,expand.grid(rep(list(c('a', 'b', 'c', 'd','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')), 3)))

threecompare<-sapply(threecombo, function(x) length(grep(x, myData)))

The code works the way I want it to, and I would like to repeat the above steps for longer character lengths (4, 5, 6, etc) but it takes a while to run. Is there a better way of doing this?

Upvotes: 1

Views: 976

Answers (2)

Ken Benoit
Ken Benoit

Reputation: 14902

Since you are probably looking for combinations of fruit flavors from a set of text that includes non-fruit words, I've made up some documents similar to those in your example. I've used the quanteda package to construct a document-term matrix and then filter based on ngrams containing the fruit words.

docs <- c("One flavor is apple strawberry lime.", 
          "Another flavor is apple grape lime.", 
          "Pineapple mango guava is our newest flavor.",
          "There is also kiwi guava and grape apple.", 
          "Mixed berry was introduced last year.", 
          "Did you like kiwi guava pineapple?",
          "Try the lime mixed berry.")
flavorwords <- c("apple", "guava", "berry", "kiwi", "guava", "grape")

require(quanteda)
# form a document-feature matrix ignoring common stopwords + "like"
# for ngrams, bigrams, trigrams
fruitDfm <- dfm(docs, ngrams = 1:3, ignoredFeatures = c("like", "also", stopwords("english")))
## Creating a dfm from a character vector ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 7 documents
##    ... indexing features: 90 feature types
##    ... removed 47 features, from 176 supplied (glob) feature types
##    ... complete. 
##    ... created a 7 x 40 sparse dfm
## Elapsed time: 0.01 seconds.
# select only those features containing flavorwords as regular expression
fruitDfm <- selectFeatures(fruitDfm, flavorwords, valuetype = "regex")
## kept 22 features, from 5 supplied (regex) feature types
# show the features
topfeatures(fruitDfm, nfeature(fruitDfm))
##                apple                 guava                 grape             pineapple                  kiwi 
##                    3                     3                     2                     2                     2 
##           kiwi_guava                 berry           mixed_berry            strawberry      apple_strawberry 
##                    2                     2                     2                     1                     1 
##      strawberry_lime apple_strawberry_lime           apple_grape            grape_lime      apple_grape_lime 
##                    1                     1                     1                     1                     1 
##      pineapple_mango           mango_guava pineapple_mango_guava           grape_apple       guava_pineapple 
##                    1                     1                     1                     1                     1 
## kiwi_guava_pineapple      lime_mixed_berry 
##                    1                     1 

Added:

If you are looking to match the terms not separated by spaces to the document, you can form ngrams with a null string concatenator, and match as below.

flavorwordsConcat <- c("applestrawberrylime", "applegrapelime", "pineapplemangoguava",
                       "kiwiguava", "grapeapple", "mixedberry", "kiwiguavapineapple",
                       "limemixedberry")

fruitDfm <- dfm(docs, ngrams = 1:3, concatenator = "")
fruitDfm <- fruitDfm[, features(fruitDfm) %in% flavorwordsConcat]
fruitDfm
# Document-feature matrix of: 7 documents, 8 features.
# 7 x 8 sparse Matrix of class "dfmSparse"
#        features
# docs  applestrawberrylime applegrapelime pineapplemangoguava kiwiguava grapeapple mixedberry kiwiguavapineapple limemixedberry
# text1                   1              0                   0         0          0          0                  0              0
# text2                   0              1                   0         0          0          0                  0              0
# text3                   0              0                   1         0          0          0                  0              0
# text4                   0              0                   0         1          1          0                  0              0
# text5                   0              0                   0         0          0          1                  0              0
# text6                   0              0                   0         1          0          0                  1              0
# text7                   0              0                   0         0          0          1                  0              1

If your text contains the concatenated flavour words, then you can match the unigram dfm to all trigram permutations of individual fruit words using

unigramFlavorWords <- c("apple", "guava", "grape", "pineapple", "kiwi")
head(unlist(combinat::permn(unigramFlavorWords, paste, collapse = "")))
[1] "appleguavagrapepineapplekiwi" "appleguavagrapekiwipineapple" "appleguavakiwigrapepineapple" 
[4] "applekiwiguavagrapepineapple" "kiwiappleguavagrapepineapple" "kiwiappleguavapineapplegrape"

Upvotes: 2

arvi1000
arvi1000

Reputation: 9582

Your initial question was a simple task for grep / grepl, and I see you've incorporated this part of my answer into your revised question.

docs <- c('applestrawberrylime', 'applegrapelime', 'pineapplemangoguava',
          'kiwiguava', 'grapeapple', 'mixedberry', 'kiwiguavapineapple',
          'limemixedberry')

patterns <-  c('appleberry', 'guava', 'applestrawberry', 'kiwiguava', 
               'grapeapple', 'grape', 'app', 'ap', 'wig', 'mem', 'go')

# how often does each pattern occur in the set of docs?
sapply(patterns, function(x) sum(grepl(x, docs)))

If you want to check for every possible pattern, you can search for every combination of letters (as you begin doing above), but that's obviously the long way around.

One strategy is to count the frequency only of each pattern that actually occurs. Each document of character length n has 1 possible pattern of length n, 2 patterns of length n - 1 and so on. You can extract each of these, then count em up.

all_patterns <- lapply(docs, function(x) {

    # individual chars in this doc
    chars <- unlist(strsplit(x, ''))

    # unique possible sequence lengths
    seqs <- sapply(1:nchar(x), seq)

    # each sequence in each position
    sapply(seqs, function(y) {
      start_pos <- 0:(nchar(x) - max(y))
      sapply(start_pos, function(z) paste(chars[z + y], collapse=''))
    })
})

unq_patterns <- unique(unlist(all_patterns))

# how often does each unique pattern occur in the set of docs?
occur <- sapply(unq_patterns, function(x) sum(grepl(x, docs)))

# top 25 most frequent patterns
sort(occur, decreasing = T)[1:25]     

# e     i     a     l     p     r     m    ap    pp    pl    le   app   ppl   
# 7     7     6     6     5     5     5     5     5     5     5     5     5
# ple  appl  pple apple     g     w     b     y  ra    be    er    rr 
#   5     5     5     5     5     3     3     3   3     3     3     3 

This works and runs quickly, but as the corpus of docs grows longer, you may bog down (even on this simple example, there are 625 unique patterns). One could use parallel processing for all the s/lapply calls, but still...

Upvotes: 1

Related Questions