Reputation: 45
I need to count the frequency of occurrence of words or word phrases in a list, based on a separate source list.
I have a data frame of authors and research areas. Each author has a list of 1 or more research areas (words/word phrases) associated with their name.
Sometimes the same research area occurs more than once, and I want them counted every time (i.e., not a unique list).
I need to count the number of times an author's research areas match those in a set list of research areas.
I can do it on a per-author basis, but not for the entire list of authors.
(In actuality, there are 4 set lists, divided into research categories: life science, social science, etc., and I need to count the occurrence of research areas per author from each research category, i.e., how many life science areas are in their list, how many social science areas are in their list, etc.
A simple example is below for one research category, but in the real examples there are 4 separate and unique 'lexicons'.
test.small <- data.frame(AuthorID=c("Mavis", "Cleotha", "Yvonne"),
RA=c("Fisheries, Fisheries, Geography, Marine Biology", "Fisheries",
"Marine Biology, Marine Biology, Fisheries, Zoology"))
RA.text <- as.character(test.small$RA)
RA.list <- strsplit(RA.text, ", ", perl=TRUE)
lexicon <- c("Fisheries", "Marine Biology")
sum(RA.list[[3]] %in% lexicon)
How do I do this for the entire list, summing the total occurrence for each author individually
and storing that numeric sum in a vector that I can use for other calculations?
Upvotes: 0
Views: 347
Reputation: 39154
We can use str_count
from the stringr
package. In the following example, test.small2
is a data frame with a column Count
showing the word counts.
Notice that I added stringsAsFactors = FALSE
when creating test.small
to make sure all columns are in character, not factor.
or1
is a function from the rebus
package, which creates regular expression syntax |
.
By using str_count
, we probably don't need to strsplit
the string.
# Create example data frame
test.small <- data.frame(AuthorID=c("Mavis", "Cleotha", "Yvonne"),
RA=c("Fisheries, Fisheries, Geography, Marine Biology", "Fisheries",
"Marine Biology, Marine Biology, Fisheries, Zoology"),
stringsAsFactors = FALSE)
# Load packages
library(dplyr)
library(stringr)
library(rebus)
# Define the lexicon
lexicon <- c("Fisheries", "Marine Biology")
# Create a new column showing the total number of words matching the lexicon
test.small2 <- test.small %>% mutate(Count = str_count(RA, or1(lexicon)))
Upvotes: 1
Reputation: 25395
You could create a function, and use lapply to apply that functions to all rows. The following works for me, if I understood your question correctly:
test.small <- data.frame(AuthorID=c("Mavis", "Cleotha", "Yvonne"),
RA=c("Fisheries, Fisheries, Geography, Marine Biology", "Fisheries",
"Marine Biology, Marine Biology, Fisheries, Zoology"))
frequency_counter <- function(x,lexicon)
{
x<- as.character(x)
RA.list <- strsplit(x, ", ", perl=TRUE)
count = sum(RA.list[[1]] %in% lexicon)
return(count)
}
# apply the function
lexicon <- c("Fisheries", "Marine Biology")
test.small$count = lapply(test.small$RA,function(x) frequency_counter(x,lexicon))
Upvotes: 1