Sum word frequency in one list based on a second list in R

Question

I need to count the frequency of occurrence of words or word phrases in a list, based on a separate source list.
I have a data frame of authors and research areas. Each author has a list of 1 or more research areas (words/word phrases) associated with their name.
Sometimes the same research area occurs more than once, and I want them counted every time (i.e., not a unique list).
I need to count the number of times an author's research areas match those in a set list of research areas.
I can do it on a per-author basis, but not for the entire list of authors.
(In actuality, there are 4 set lists, divided into research categories: life science, social science, etc., and I need to count the occurrence of research areas per author from each research category, i.e., how many life science areas are in their list, how many social science areas are in their list, etc. A simple example is below for one research category, but in the real examples there are 4 separate and unique 'lexicons'.

test.small <- data.frame(AuthorID=c("Mavis", "Cleotha", "Yvonne"), 
                     RA=c("Fisheries, Fisheries, Geography, Marine Biology", "Fisheries", 
                          "Marine Biology, Marine Biology, Fisheries, Zoology"))
RA.text <- as.character(test.small$RA)
RA.list <- strsplit(RA.text, ", ", perl=TRUE)
lexicon <- c("Fisheries", "Marine Biology")

sum(RA.list[[3]] %in% lexicon)

How do I do this for the entire list, summing the total occurrence for each author individually
and storing that numeric sum in a vector that I can use for other calculations?

Florian · Accepted Answer

You could create a function, and use lapply to apply that functions to all rows. The following works for me, if I understood your question correctly:

test.small <- data.frame(AuthorID=c("Mavis", "Cleotha", "Yvonne"), 
                         RA=c("Fisheries, Fisheries, Geography, Marine Biology", "Fisheries", 
                              "Marine Biology, Marine Biology, Fisheries, Zoology"))

frequency_counter <- function(x,lexicon)
{
x<- as.character(x)
RA.list <- strsplit(x, ", ", perl=TRUE)
count = sum(RA.list[[1]] %in% lexicon)
return(count)
}

# apply the function
lexicon <- c("Fisheries", "Marine Biology")
test.small$count = lapply(test.small$RA,function(x) frequency_counter(x,lexicon))

Sum word frequency in one list based on a second list in R

Answers (2)

Related Questions