Reputation: 131
EDIT: I have a data frame wherein column 1 has id on some texts and column 2 has the text itself as strings. I have a set of multiple words and the task is to have stringr
count the number of times each word is found within the texts. The words are to be supplied as fixed, not as regex.
Two problems stand out:
(1) How to supply a vector containing multiple words as a fixed (not regex) pattern?
(2) How to append the findings to the data frame?
(3) How to do it for very large data?
A previous answer from user @akrun answered points (1) and (2), but (3) is still an issue. Here is a reproducible example.
## create a very large data.frame with the text column to be analyzed
doc_number <- c()
doc_text <- c()
for(i in 1:60000){
# generate many random strings mentioning 'proposals'
doc_number[i] <- paste0("doc_",i)
set.seed(i+3)
doc_text[i] <- paste0("This is about proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")",
" and about proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")")
}
docs_example_df <- data.frame(doc_number, doc_text)
head(docs_example_df) # resulting df has 'doc_text' column which mentions proposals
> head(docs_example_df)
doc_number doc_text
1 doc_1 This is about proposal (6623/k) and about proposal (3866/c)
2 doc_2 This is about proposal (3254/k) and about proposal (2832/u)
3 doc_3 This is about proposal (7964/j) and about proposal (1940/n)
4 doc_4 This is about proposal (8582/g) and about proposal (3753/o)
5 doc_5 This is about proposal (4254/b) and about proposal (5686/l)
6 doc_6 This is about proposal (2588/f) and about proposal (9786/c)
# create a very large vector of 'proposals' I want to extract from doc_text
my_proposals <- c()
for(i in 1:20000){
set.seed(i+8)
my_proposals[i] <- paste0("proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")")
}
head(my_proposals) # long list of 'proposals' I wish to locate
> head(my_proposals)
[1] "proposal (2588/f)" "proposal (1490/i)" "proposal (2785/b)" "proposal (5545/z)" "proposal (6988/j)" "proposal (1264/i)"
The previous answer by @akrun (see below) recommended several solutions which worked for a small data.frame. But in such >20k objects the functions either jam or give error such as:
Problem with mutate() input matches. x Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)
So, in short, how to apply a very long list of vectors to an also very long data.frame and store extracted matches in something like a column-list in the data.frame?
Thanks everyone
Upvotes: 4
Views: 1239
Reputation: 887068
We could paste
them together and wrap it in regex
instead of fixed
. In dplyr
1.0.0, there are multiple functions introduced and one of them is across
library(dplyr) #1.0.0
library(stringr)
test_df %>%
mutate(matches = str_extract_all(text,
pattern = regex(str_c(keywords, collapse = "|"))))
If we need the final expected output, after creating the list
column in matches
, unnest
to expand the rows, get the count
and reshape it to 'wide' format with pivot_wider
library(tidyr)
test_df %>%
mutate(matches = str_extract_all(test_df$text, pattern = regex(str_c(keywords, collapse = "|")))) %>%
unnest(c(matches)) %>%
count(across(doc_id:matches)) %>%
pivot_wider(names_from = matches, values_from = n, values_fill = list(n = 0))
# A tibble: 4 x 6
# doc_id text water alcohol gasoline h2o
# <chr> <chr> <int> <int> <int> <int>
#1 doc1 This text refers to water 1 0 0 0
#2 doc2 This text refers to water and alcohol 1 1 0 0
#3 doc4 This text refers to gasoline and more gasoline 0 0 2 0
#4 doc5 This text refers to (h2o) 0 0 0 1
If we have a dplyr
< 1.0.0, instead of across
just specify the names of the columns in count
... %>%
count(doc_id, text, matches)
... %>%
Or convert the column names to symbols and evaluate
... %>%
count(!!! rlang::syms(names(.)))
... %>%
In the above method 'doc3' is removed as there were no matches. If we need to keep it, specify the keep_empty = TRUE
in unnest
test_df %>%
mutate(matches = str_extract_all(test_df$text,
pattern = regex(str_c(keywords, collapse = "|")))) %>%
unnest(c(matches), keep_empty = TRUE) %>%
count(across(doc_id:matches)) %>%
mutate(n = replace(n, is.na(matches), 0)) %>%
pivot_wider(names_from = matches, values_from = n, values_fill = list(n = 0)) %>%
select(-`NA`)
# A tibble: 5 x 6
# doc_id text water alcohol gasoline h2o
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 doc1 This text refers to water 1 0 0 0
#2 doc2 This text refers to water and alcohol 1 1 0 0
#3 doc3 This text refers to alcoolh 0 0 0 0
#4 doc4 This text refers to gasoline and more gasoline 0 0 2 0
#5 doc5 This text refers to (h2o) 0 0 0 1
In addition to the above method, an easier option is to use str_count
library(purrr)
map_dfc(set_names(keywords, keywords), ~
str_count(test_df$text, .x)) %>%
bind_cols(test_df, .)
# doc_id text water alcohol gasoline (h2o)
#1 doc1 This text refers to water 1 0 0 0
#2 doc2 This text refers to water and alcohol 1 1 0 0
#3 doc3 This text refers to alcoolh 0 0 0 0
#4 doc4 This text refers to gasoline and more gasoline 0 0 2 0
#5 doc5 This text refers to (h2o) 0 0 0 1
Or using base R
test_df[keywords] <- lapply(keywords, function(x)
lengths(regmatches(test_df$text, gregexpr(x, test_df$text))))
Although str_extract
is vectorized for pattern
, it would be in such a way that the pattern
length would be the same as the column length and it would be doing a correspoinding extraction
Upvotes: 3