Christopher Penn
Christopher Penn

Reputation: 539

How to convert character vector into variable names and str_count?

I'm trying to convert a character vector of terms into variables by a function that executes a str_count against a text dataframe, and I'm not sure how to do this.

Given a vector like:

variablenames <- c("strong","weak","happy","sad")

and a dataframe of text such as:

library(tidyverse)
textdf <- as.data.frame("Happy was a dwarf who was perpetually sad.") %>% rename(text = 1)

I think I want something like this:

countstring_fn <- function(variablenames,textdf){
for(term in variablenames){
paste0(term,"count") <- str_count(term,textdf)
}
}

But I'm pretty sure that doesn't work. The intended output is:

text,strongcount,weakcount,happycount,sadcount
"Happy was a dwarf who was perpetually sad.",0,0,1,1

Has anyone done something like this and made it work?

Upvotes: 0

Views: 1531

Answers (4)

Benjamin Ye
Benjamin Ye

Reputation: 518

Here's another way.

library(tidyverse)
variablenames <- c("strong", "weak", "happy", "sad")
textdf <- tibble(
  text = c(
    '"Happy was a dwarf who was perpetually sad."',
    '"If you\'re strong, you\'re not weak."'
  )
)
textdf[, str_c(variablenames, 'count')] <- do.call(
  rbind, 
  lapply(
    textdf$text, 
    function(df) { 
      str_count(toupper(df), toupper(variablenames)) 
    }
  )
)
invisible(
  apply(
    textdf, 
    1, 
    function(vec) {
      cat(str_c(str_c(vec, collapse = ','), '\n'))
    }
  )
)

The main differences here is that the strings in the textdf dataframe come wrapped with double quotes (if you're importing data from a .csv, you can just call str_c('"', textdf$text, '"') for the same effect). Then, we convert all of the text and patterns to uppercase to ensure that all matches are found. Lastly, we can call str_count() to get a integer vector of the counts, which we can assign individually to specific columns by defining the desired column names.

The prntFunc function then prints each row in the data frame to console using apply() (vectorization is faster than using a for loop):

"Happy was a dwarf who was perpetually sad.",0,0,1,1
"If you're strong, you're not weak.",1,1,0,0

We first use str_c() for its collapsing ability. In other words, we can concatenate the strings in all five columns in a row into one string with , as the delimiter. Then, for cat(), we need to append a line break (\n) at the end of each "row string" using str_c() again. Finally, we can call cat() to display the strings in the console with special characters, such as ", not being accompanied by an escape character (\). The cat() call is wrapped with invisible() to suppress the NULL that cat() appends to the end when it is called interactively.

Upvotes: 2

utubun
utubun

Reputation: 4505

Yet another way:

library(tidyverse)

t(sapply(dat$strgs, str_count, pattern = coll(patts, T, 'en'))) %>%
  data.frame %>%
  set_names(., patts) %>%
  bind_cols(dat, .)

#   strgs                                strength ignorance present future collapse
# 1 War Is Peace, Freedom Is Slavery...  1        1         0       0      0
# 2 Who controls the past controls t...  0        0         1       1      0
# 3 The collapse of the USSR was the...  0        0         0       0      1

Data:

patts <- c("strength", "ignorance", "present", "future", "collapse")

dat <- data.frame(
  strgs = c(
    "War Is Peace, Freedom Is Slavery, and Ignorance Is Strength.",
    "Who controls the past controls the future: who controls the present controls the past.",
    "The collapse of the USSR was the greatest geopolitical catastrophe of the century."
  )
)

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389235

We can convert the text to lower case and check the occurrence of variablenames in each text and return a comma-separated string. We add word boundaries (\\b) to each variablenames to avoid matching "sad" with "saddened". We can then separate the data into different columns

library(tidyverse)

textdf %>%
   mutate(count = map_chr(tolower(text), function(x) 
    toString(map_int(paste0("\\b",variablenames,"\\b"), ~str_count(x, .x))))) %>%
  separate(count, into = paste0(variablenames, "_count"), sep = ",", convert = TRUE)

#                                        text strong_count weak_count happy_count sad_count
#1 Happy was a dwarf who was perpetually sad.            0          0           1         1

Upvotes: 1

rpolicastro
rpolicastro

Reputation: 1305

# added second row to show output of function

textdf <- structure(list(text = c("Happy was a dwarf who was perpetually sad.",
"Sad was a dwarf who was perpetually sad.")), row.names = c(NA,
-2L), class = "data.frame")

# counting the occurrences of words in 'variablenames'

pmap_df(
  textdf, function(text) {
    map(variablenames, ~ str_count(tolower(text), pattern = .)) %>%
    t %>% as.data.frame
  }
) %>%
  setNames(variablenames) %>%
  bind_cols(textdf, .)

# Leaves you with a data frame with counts for each word as columns.

                                        text strong weak happy sad
1 Happy was a dwarf who was perpetually sad.      0    0     1   1
2   Sad was a dwarf who was perpetually sad.      0    0     0   2


Upvotes: 1

Related Questions