Reputation: 1
I have a data set of a text analysis. There's a column that shows if one of predefined terms is recognized (shows the term itself). looks somewhat like this (relevant column is "funnel_term"):
I want to count how many times each of the terms in the "funnel_term" appears. thought of a for loop but it's not working as I wished. the output I'm looking for would be something like that:
sexual - 6
racist - 4
ill - 2
Thanks in advance.
Upvotes: 0
Views: 1408
Reputation: 417
You can use grep()
for this. Example with minimal data set:
df <- data.frame(x = c("['Sexual']", "['Sexual']"))
length(grep("Sexual", df$x))
Or with a prettier output:
paste("Sexual - ", length(grep("Sexual", test$x)), sep="")
[1] "Sexual - 2"
Or with the package dplyr
:
library(dplyr)
df <- data.frame(x = c("['Sexual']", "['Sexual']"))
df %>% dplyr::count(x)
This doesn't work for cells with two words, for example "['Sexual', 'Religion']"
. So we need this:
library(dplyr)
df <- data.frame(x = c("['Sexual', 'Religion']", "['Sexual']"))
df %>% mutate(x2 = strsplit(as.character(x), ",")) %>%
unnest(x2) %>%
mutate(x2 = str_replace_all(x2, "[^[:alnum:]]", "")) %>%
count(x2)
Upvotes: 2
Reputation: 1
I created a sample data set similar to yours with the following code:
sample <- tribble(~funnel_term, "['Sexual']", "['Islam', 'Religion']", "['Sexual', 'Islam']")
Which gives you a data frame that looks like this:
funnel_term
<chr>
1 ['Sexual']
2 ['Islam', 'Religion']
3 ['Sexual', 'Islam']
You can get rid of the brackets and single quotes and then separate the rows so that each item in the list becomes it's own row
sample.1 <- sample %>% mutate(funnel_term_new = gsub("\\[|\\]|\'", "", funnel_term)) %>% separate_rows(funnel_term_new, sep = ", ")
Which gives you a data frame that looks like this:
funnel_term funnel_term_new
<chr> <chr>
1 ['Sexual'] Sexual
2 ['Islam', 'Religion'] Islam
3 ['Islam', 'Religion'] Religion
4 ['Sexual', 'Islam'] Sexual
5 ['Sexual', 'Islam'] Islam
Now that you have all of the funnel terms into their own row, you can use simple dplyr
functions to get the count of each unique funnel_term:
sample.final <- sample.1 %>% group_by(funnel_term_new) %>% summarise(n = n())
funnel_term_new n
<chr> <int>
1 Islam 2
2 Religion 1
3 Sexual 2
Upvotes: 0
Reputation: 968
Do you want to count multiple occurrences of words like "racist"
within a row? If so, you may want to check out the function gregexpr
:
gregexpr("sexual", df$text)
This will tell you the starting points of each of the words "racist"
in your column. To get a count of all of them, you can do:
object_1 <- gregexpr("sexual", df$text)
for (i in seq_len(length(object_1))) {
if (object_1[[i]] == -1) {
object_1[[i]] <- NULL
}
}
sum(sapply(object_1, function (x) {
length(x)
}))
If you want to find words like "sexual"
but not words like "asexual"
or "sexually"
, you should use regular expressions. Use
gregexpr("\\bsexual\\b", df$text)
instead of
gregexpr("sexual", df$text)
To get your desired output, you would do:
original_funnel_terms <- c("sexual", "racist", "ill")
funnel_terms <- paste0("\\b", funnel_terms, "\\b")
output_1 <- sapply(seq_len(length(funnel_terms)), function (z) {
sum(sapply(sapply(gregexpr(funnel_terms[z], df$text), function (x) {
if (x[1] == -1) {
y <- NULL
} else {
y <- x
}
y
}), length))
})
names(output_1) <- original_funnel_terms
output_2 <- paste(names(output), " - ", as.character(output), sep = "")
cat(output_2, sep = "\n")
Upvotes: 0