Reputation:
I was trying to find the frequency of character patterns (word parts) in a large data set.
For example, I have a list of the following in a csv file:
Is there a way to find the frequency of all character combinations? Like:
Update: This is what I have for finding the frequency of all character patterns of length three in my data:
threecombo <- do.call(paste0,expand.grid(rep(list(c('a', 'b', 'c', 'd','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')), 3)))
threecompare<-sapply(threecombo, function(x) length(grep(x, myData)))
The code works the way I want it to, and I would like to repeat the above steps for longer character lengths (4, 5, 6, etc) but it takes a while to run. Is there a better way of doing this?
Upvotes: 1
Views: 976
Reputation: 14902
Since you are probably looking for combinations of fruit flavors from a set of text that includes non-fruit words, I've made up some documents similar to those in your example. I've used the quanteda package to construct a document-term matrix and then filter based on ngrams containing the fruit words.
docs <- c("One flavor is apple strawberry lime.",
"Another flavor is apple grape lime.",
"Pineapple mango guava is our newest flavor.",
"There is also kiwi guava and grape apple.",
"Mixed berry was introduced last year.",
"Did you like kiwi guava pineapple?",
"Try the lime mixed berry.")
flavorwords <- c("apple", "guava", "berry", "kiwi", "guava", "grape")
require(quanteda)
# form a document-feature matrix ignoring common stopwords + "like"
# for ngrams, bigrams, trigrams
fruitDfm <- dfm(docs, ngrams = 1:3, ignoredFeatures = c("like", "also", stopwords("english")))
## Creating a dfm from a character vector ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 7 documents
## ... indexing features: 90 feature types
## ... removed 47 features, from 176 supplied (glob) feature types
## ... complete.
## ... created a 7 x 40 sparse dfm
## Elapsed time: 0.01 seconds.
# select only those features containing flavorwords as regular expression
fruitDfm <- selectFeatures(fruitDfm, flavorwords, valuetype = "regex")
## kept 22 features, from 5 supplied (regex) feature types
# show the features
topfeatures(fruitDfm, nfeature(fruitDfm))
## apple guava grape pineapple kiwi
## 3 3 2 2 2
## kiwi_guava berry mixed_berry strawberry apple_strawberry
## 2 2 2 1 1
## strawberry_lime apple_strawberry_lime apple_grape grape_lime apple_grape_lime
## 1 1 1 1 1
## pineapple_mango mango_guava pineapple_mango_guava grape_apple guava_pineapple
## 1 1 1 1 1
## kiwi_guava_pineapple lime_mixed_berry
## 1 1
Added:
If you are looking to match the terms not separated by spaces to the document, you can form ngrams with a null string concatenator, and match as below.
flavorwordsConcat <- c("applestrawberrylime", "applegrapelime", "pineapplemangoguava",
"kiwiguava", "grapeapple", "mixedberry", "kiwiguavapineapple",
"limemixedberry")
fruitDfm <- dfm(docs, ngrams = 1:3, concatenator = "")
fruitDfm <- fruitDfm[, features(fruitDfm) %in% flavorwordsConcat]
fruitDfm
# Document-feature matrix of: 7 documents, 8 features.
# 7 x 8 sparse Matrix of class "dfmSparse"
# features
# docs applestrawberrylime applegrapelime pineapplemangoguava kiwiguava grapeapple mixedberry kiwiguavapineapple limemixedberry
# text1 1 0 0 0 0 0 0 0
# text2 0 1 0 0 0 0 0 0
# text3 0 0 1 0 0 0 0 0
# text4 0 0 0 1 1 0 0 0
# text5 0 0 0 0 0 1 0 0
# text6 0 0 0 1 0 0 1 0
# text7 0 0 0 0 0 1 0 1
If your text contains the concatenated flavour words, then you can match the unigram dfm to all trigram permutations of individual fruit words using
unigramFlavorWords <- c("apple", "guava", "grape", "pineapple", "kiwi")
head(unlist(combinat::permn(unigramFlavorWords, paste, collapse = "")))
[1] "appleguavagrapepineapplekiwi" "appleguavagrapekiwipineapple" "appleguavakiwigrapepineapple"
[4] "applekiwiguavagrapepineapple" "kiwiappleguavagrapepineapple" "kiwiappleguavapineapplegrape"
Upvotes: 2
Reputation: 9582
Your initial question was a simple task for grep
/ grepl
, and I see you've incorporated this part of my answer into your revised question.
docs <- c('applestrawberrylime', 'applegrapelime', 'pineapplemangoguava',
'kiwiguava', 'grapeapple', 'mixedberry', 'kiwiguavapineapple',
'limemixedberry')
patterns <- c('appleberry', 'guava', 'applestrawberry', 'kiwiguava',
'grapeapple', 'grape', 'app', 'ap', 'wig', 'mem', 'go')
# how often does each pattern occur in the set of docs?
sapply(patterns, function(x) sum(grepl(x, docs)))
If you want to check for every possible pattern, you can search for every combination of letters (as you begin doing above), but that's obviously the long way around.
One strategy is to count the frequency only of each pattern that actually occurs. Each document of character length n
has 1 possible pattern of length n
, 2 patterns of length n - 1
and so on. You can extract each of these, then count em up.
all_patterns <- lapply(docs, function(x) {
# individual chars in this doc
chars <- unlist(strsplit(x, ''))
# unique possible sequence lengths
seqs <- sapply(1:nchar(x), seq)
# each sequence in each position
sapply(seqs, function(y) {
start_pos <- 0:(nchar(x) - max(y))
sapply(start_pos, function(z) paste(chars[z + y], collapse=''))
})
})
unq_patterns <- unique(unlist(all_patterns))
# how often does each unique pattern occur in the set of docs?
occur <- sapply(unq_patterns, function(x) sum(grepl(x, docs)))
# top 25 most frequent patterns
sort(occur, decreasing = T)[1:25]
# e i a l p r m ap pp pl le app ppl
# 7 7 6 6 5 5 5 5 5 5 5 5 5
# ple appl pple apple g w b y ra be er rr
# 5 5 5 5 5 3 3 3 3 3 3 3
This works and runs quickly, but as the corpus of docs grows longer, you may bog down (even on this simple example, there are 625 unique patterns). One could use parallel processing for all the s/lapply
calls, but still...
Upvotes: 1