Sap
Sap

Reputation: 1

How to count the occurrences of each word in a column in R?

I have a data set of a text analysis. There's a column that shows if one of predefined terms is recognized (shows the term itself). looks somewhat like this (relevant column is "funnel_term"):

sample of my data set

I want to count how many times each of the terms in the "funnel_term" appears. thought of a for loop but it's not working as I wished. the output I'm looking for would be something like that:

sexual - 6

racist - 4

ill - 2

Thanks in advance.

Upvotes: 0

Views: 1408

Answers (3)

Joris
Joris

Reputation: 417

You can use grep() for this. Example with minimal data set:

df <- data.frame(x = c("['Sexual']", "['Sexual']"))

length(grep("Sexual", df$x))

Or with a prettier output:

paste("Sexual - ", length(grep("Sexual", test$x)), sep="")

[1] "Sexual - 2"

Or with the package dplyr:

library(dplyr)

df <- data.frame(x = c("['Sexual']", "['Sexual']"))
df %>% dplyr::count(x)

This doesn't work for cells with two words, for example "['Sexual', 'Religion']". So we need this:

library(dplyr)

df <- data.frame(x = c("['Sexual', 'Religion']", "['Sexual']"))

df %>% mutate(x2 = strsplit(as.character(x), ",")) %>% 
  unnest(x2) %>% 
  mutate(x2 = str_replace_all(x2, "[^[:alnum:]]", "")) %>% 
  count(x2)

Upvotes: 2

elamps
elamps

Reputation: 1

I created a sample data set similar to yours with the following code:

sample <- tribble(~funnel_term, "['Sexual']", "['Islam', 'Religion']", "['Sexual', 'Islam']")

Which gives you a data frame that looks like this:

  funnel_term          
  <chr>                
1 ['Sexual']           
2 ['Islam', 'Religion']
3 ['Sexual', 'Islam']  

You can get rid of the brackets and single quotes and then separate the rows so that each item in the list becomes it's own row

sample.1 <- sample %>% mutate(funnel_term_new = gsub("\\[|\\]|\'", "", funnel_term)) %>% separate_rows(funnel_term_new, sep = ", ")

Which gives you a data frame that looks like this:

  funnel_term           funnel_term_new
  <chr>                 <chr>          
1 ['Sexual']            Sexual         
2 ['Islam', 'Religion'] Islam          
3 ['Islam', 'Religion'] Religion       
4 ['Sexual', 'Islam']   Sexual         
5 ['Sexual', 'Islam']   Islam   

Now that you have all of the funnel terms into their own row, you can use simple dplyr functions to get the count of each unique funnel_term:

sample.final <- sample.1 %>% group_by(funnel_term_new) %>% summarise(n = n())
  funnel_term_new     n
  <chr>           <int>
1 Islam               2
2 Religion            1
3 Sexual              2

Upvotes: 0

David Moore
David Moore

Reputation: 968

Do you want to count multiple occurrences of words like "racist" within a row? If so, you may want to check out the function gregexpr:

gregexpr("sexual", df$text)

This will tell you the starting points of each of the words "racist" in your column. To get a count of all of them, you can do:

object_1 <- gregexpr("sexual", df$text)

for (i in seq_len(length(object_1))) {
  if (object_1[[i]] == -1) {
    object_1[[i]] <- NULL
  }
}

sum(sapply(object_1, function (x) {
  length(x)
}))

If you want to find words like "sexual" but not words like "asexual" or "sexually", you should use regular expressions. Use

gregexpr("\\bsexual\\b", df$text)

instead of

gregexpr("sexual", df$text)

To get your desired output, you would do:

original_funnel_terms <- c("sexual", "racist", "ill")
funnel_terms <- paste0("\\b", funnel_terms, "\\b")
output_1 <- sapply(seq_len(length(funnel_terms)), function (z) {
  sum(sapply(sapply(gregexpr(funnel_terms[z], df$text), function (x) {
    if (x[1] == -1) {
      y <- NULL
    } else {
      y <- x
    }
    y
  }), length))
})
names(output_1) <- original_funnel_terms
output_2 <- paste(names(output), " - ", as.character(output), sep = "")
cat(output_2, sep = "\n")

Upvotes: 0

Related Questions