Reputation: 539
I'm trying to convert a character vector of terms into variables by a function that executes a str_count against a text dataframe, and I'm not sure how to do this.
Given a vector like:
variablenames <- c("strong","weak","happy","sad")
and a dataframe of text such as:
library(tidyverse)
textdf <- as.data.frame("Happy was a dwarf who was perpetually sad.") %>% rename(text = 1)
I think I want something like this:
countstring_fn <- function(variablenames,textdf){
for(term in variablenames){
paste0(term,"count") <- str_count(term,textdf)
}
}
But I'm pretty sure that doesn't work. The intended output is:
text,strongcount,weakcount,happycount,sadcount
"Happy was a dwarf who was perpetually sad.",0,0,1,1
Has anyone done something like this and made it work?
Upvotes: 0
Views: 1531
Reputation: 518
Here's another way.
library(tidyverse)
variablenames <- c("strong", "weak", "happy", "sad")
textdf <- tibble(
text = c(
'"Happy was a dwarf who was perpetually sad."',
'"If you\'re strong, you\'re not weak."'
)
)
textdf[, str_c(variablenames, 'count')] <- do.call(
rbind,
lapply(
textdf$text,
function(df) {
str_count(toupper(df), toupper(variablenames))
}
)
)
invisible(
apply(
textdf,
1,
function(vec) {
cat(str_c(str_c(vec, collapse = ','), '\n'))
}
)
)
The main differences here is that the strings in the textdf
dataframe come wrapped with double quotes (if you're importing data from a .csv, you can just call str_c('"', textdf$text, '"')
for the same effect). Then, we convert all of the text and patterns to uppercase to ensure that all matches are found. Lastly, we can call str_count()
to get a integer vector of the counts, which we can assign individually to specific columns by defining the desired column names.
The prntFunc
function then prints each row in the data frame to console using apply()
(vectorization is faster than using a for loop):
"Happy was a dwarf who was perpetually sad.",0,0,1,1
"If you're strong, you're not weak.",1,1,0,0
We first use str_c()
for its collapsing ability. In other words, we can concatenate the strings in all five columns in a row into one string with ,
as the delimiter. Then, for cat()
, we need to append a line break (\n
) at the end of each "row string" using str_c()
again. Finally, we can call cat()
to display the strings in the console with special characters, such as "
, not being accompanied by an escape character (\
). The cat()
call is wrapped with invisible()
to suppress the NULL
that cat()
appends to the end when it is called interactively.
Upvotes: 2
Reputation: 4505
Yet another way:
library(tidyverse)
t(sapply(dat$strgs, str_count, pattern = coll(patts, T, 'en'))) %>%
data.frame %>%
set_names(., patts) %>%
bind_cols(dat, .)
# strgs strength ignorance present future collapse
# 1 War Is Peace, Freedom Is Slavery... 1 1 0 0 0
# 2 Who controls the past controls t... 0 0 1 1 0
# 3 The collapse of the USSR was the... 0 0 0 0 1
Data:
patts <- c("strength", "ignorance", "present", "future", "collapse")
dat <- data.frame(
strgs = c(
"War Is Peace, Freedom Is Slavery, and Ignorance Is Strength.",
"Who controls the past controls the future: who controls the present controls the past.",
"The collapse of the USSR was the greatest geopolitical catastrophe of the century."
)
)
Upvotes: 1
Reputation: 389235
We can convert the text
to lower case and check the occurrence of variablenames
in each text and return a comma-separated string. We add word boundaries (\\b
) to each variablenames
to avoid matching "sad" with "saddened". We can then separate
the data into different columns
library(tidyverse)
textdf %>%
mutate(count = map_chr(tolower(text), function(x)
toString(map_int(paste0("\\b",variablenames,"\\b"), ~str_count(x, .x))))) %>%
separate(count, into = paste0(variablenames, "_count"), sep = ",", convert = TRUE)
# text strong_count weak_count happy_count sad_count
#1 Happy was a dwarf who was perpetually sad. 0 0 1 1
Upvotes: 1
Reputation: 1305
# added second row to show output of function
textdf <- structure(list(text = c("Happy was a dwarf who was perpetually sad.",
"Sad was a dwarf who was perpetually sad.")), row.names = c(NA,
-2L), class = "data.frame")
# counting the occurrences of words in 'variablenames'
pmap_df(
textdf, function(text) {
map(variablenames, ~ str_count(tolower(text), pattern = .)) %>%
t %>% as.data.frame
}
) %>%
setNames(variablenames) %>%
bind_cols(textdf, .)
# Leaves you with a data frame with counts for each word as columns.
text strong weak happy sad
1 Happy was a dwarf who was perpetually sad. 0 0 1 1
2 Sad was a dwarf who was perpetually sad. 0 0 0 2
Upvotes: 1