Tessa Francis
Tessa Francis

Reputation: 45

Sum word frequency in one list based on a second list in R

I need to count the frequency of occurrence of words or word phrases in a list, based on a separate source list.
I have a data frame of authors and research areas. Each author has a list of 1 or more research areas (words/word phrases) associated with their name.
Sometimes the same research area occurs more than once, and I want them counted every time (i.e., not a unique list).
I need to count the number of times an author's research areas match those in a set list of research areas.
I can do it on a per-author basis, but not for the entire list of authors.
(In actuality, there are 4 set lists, divided into research categories: life science, social science, etc., and I need to count the occurrence of research areas per author from each research category, i.e., how many life science areas are in their list, how many social science areas are in their list, etc. A simple example is below for one research category, but in the real examples there are 4 separate and unique 'lexicons'.

test.small <- data.frame(AuthorID=c("Mavis", "Cleotha", "Yvonne"), 
                     RA=c("Fisheries, Fisheries, Geography, Marine Biology", "Fisheries", 
                          "Marine Biology, Marine Biology, Fisheries, Zoology"))
RA.text <- as.character(test.small$RA)
RA.list <- strsplit(RA.text, ", ", perl=TRUE)
lexicon <- c("Fisheries", "Marine Biology")

sum(RA.list[[3]] %in% lexicon)

How do I do this for the entire list, summing the total occurrence for each author individually
and storing that numeric sum in a vector that I can use for other calculations?

Upvotes: 0

Views: 347

Answers (2)

www
www

Reputation: 39154

We can use str_count from the stringr package. In the following example, test.small2 is a data frame with a column Count showing the word counts.

Notice that I added stringsAsFactors = FALSE when creating test.small to make sure all columns are in character, not factor.

or1 is a function from the rebus package, which creates regular expression syntax |.

By using str_count, we probably don't need to strsplit the string.

# Create example data frame
test.small <- data.frame(AuthorID=c("Mavis", "Cleotha", "Yvonne"), 
                         RA=c("Fisheries, Fisheries, Geography, Marine Biology", "Fisheries", 
                              "Marine Biology, Marine Biology, Fisheries, Zoology"),
                         stringsAsFactors = FALSE)

# Load packages
library(dplyr)
library(stringr)
library(rebus)

# Define the lexicon
lexicon <- c("Fisheries", "Marine Biology")

# Create a new column showing the total number of words matching the lexicon
test.small2 <- test.small %>% mutate(Count = str_count(RA, or1(lexicon)))

Upvotes: 1

Florian
Florian

Reputation: 25395

You could create a function, and use lapply to apply that functions to all rows. The following works for me, if I understood your question correctly:

test.small <- data.frame(AuthorID=c("Mavis", "Cleotha", "Yvonne"), 
                         RA=c("Fisheries, Fisheries, Geography, Marine Biology", "Fisheries", 
                              "Marine Biology, Marine Biology, Fisheries, Zoology"))

frequency_counter <- function(x,lexicon)
{
x<- as.character(x)
RA.list <- strsplit(x, ", ", perl=TRUE)
count = sum(RA.list[[1]] %in% lexicon)
return(count)
}

# apply the function
lexicon <- c("Fisheries", "Marine Biology")
test.small$count = lapply(test.small$RA,function(x) frequency_counter(x,lexicon))

Upvotes: 1

Related Questions