Rafael
Rafael

Reputation: 131

Stringr: extracting all matches from strings in a data.frame column. Data.frame and vector of searched strings very large (>10k)

EDIT: I have a data frame wherein column 1 has id on some texts and column 2 has the text itself as strings. I have a set of multiple words and the task is to have stringr count the number of times each word is found within the texts. The words are to be supplied as fixed, not as regex.
Two problems stand out:
(1) How to supply a vector containing multiple words as a fixed (not regex) pattern?
(2) How to append the findings to the data frame?
(3) How to do it for very large data?

A previous answer from user @akrun answered points (1) and (2), but (3) is still an issue. Here is a reproducible example.

## create a very large data.frame with the text column to be analyzed
doc_number <- c()
doc_text <- c()

for(i in 1:60000){

# generate many random strings mentioning 'proposals'
doc_number[i] <- paste0("doc_",i)
set.seed(i+3)
doc_text[i] <- paste0("This is about proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")",
                      " and about proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")")

}
docs_example_df <- data.frame(doc_number, doc_text)

head(docs_example_df) # resulting df has 'doc_text' column which mentions proposals
> head(docs_example_df)
  doc_number                                                    doc_text
1      doc_1 This is about proposal (6623/k) and about proposal (3866/c)
2      doc_2 This is about proposal (3254/k) and about proposal (2832/u)
3      doc_3 This is about proposal (7964/j) and about proposal (1940/n)
4      doc_4 This is about proposal (8582/g) and about proposal (3753/o)
5      doc_5 This is about proposal (4254/b) and about proposal (5686/l)
6      doc_6 This is about proposal (2588/f) and about proposal (9786/c)


# create a very large vector of 'proposals' I want to extract from doc_text
my_proposals <- c()

for(i in 1:20000){

  set.seed(i+8)
  my_proposals[i] <- paste0("proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")")

}

head(my_proposals) # long list of 'proposals' I wish to locate
> head(my_proposals)
[1] "proposal (2588/f)" "proposal (1490/i)" "proposal (2785/b)" "proposal (5545/z)" "proposal (6988/j)" "proposal (1264/i)"

The previous answer by @akrun (see below) recommended several solutions which worked for a small data.frame. But in such >20k objects the functions either jam or give error such as:

Problem with mutate() input matches. x Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

So, in short, how to apply a very long list of vectors to an also very long data.frame and store extracted matches in something like a column-list in the data.frame?
Thanks everyone

Upvotes: 4

Views: 1239

Answers (1)

akrun
akrun

Reputation: 887068

We could paste them together and wrap it in regex instead of fixed. In dplyr 1.0.0, there are multiple functions introduced and one of them is across

library(dplyr) #1.0.0
library(stringr)
test_df %>%
  mutate(matches = str_extract_all(text,
                pattern = regex(str_c(keywords, collapse = "|"))))

If we need the final expected output, after creating the list column in matches, unnest to expand the rows, get the count and reshape it to 'wide' format with pivot_wider

library(tidyr)
test_df %>%
   mutate(matches = str_extract_all(test_df$text, pattern = regex(str_c(keywords, collapse = "|")))) %>% 
   unnest(c(matches)) %>% 
   count(across(doc_id:matches)) %>% 
   pivot_wider(names_from = matches, values_from = n, values_fill = list(n = 0))
# A tibble: 4 x 6
#  doc_id text                                           water alcohol gasoline   h2o
#  <chr>  <chr>                                          <int>   <int>    <int> <int>
#1 doc1   This text refers to water                          1       0        0     0
#2 doc2   This text refers to water and alcohol              1       1        0     0
#3 doc4   This text refers to gasoline and more gasoline     0       0        2     0
#4 doc5   This text refers to (h2o)                          0       0        0     1

If we have a dplyr < 1.0.0, instead of across just specify the names of the columns in count

... %>%
count(doc_id, text, matches)
... %>%

Or convert the column names to symbols and evaluate

 ... %>%
   count(!!! rlang::syms(names(.)))
... %>%

 

In the above method 'doc3' is removed as there were no matches. If we need to keep it, specify the keep_empty = TRUE in unnest

test_df %>%
    mutate(matches = str_extract_all(test_df$text, 
          pattern = regex(str_c(keywords, collapse = "|")))) %>% 
    unnest(c(matches), keep_empty = TRUE) %>% 
    count(across(doc_id:matches)) %>% 
    mutate(n = replace(n, is.na(matches), 0)) %>% 
    pivot_wider(names_from = matches, values_from = n, values_fill = list(n = 0)) %>%
    select(-`NA`)
# A tibble: 5 x 6
#  doc_id text                                           water alcohol gasoline   h2o
#  <chr>  <chr>                                          <dbl>   <dbl>    <dbl> <dbl>
#1 doc1   This text refers to water                          1       0        0     0
#2 doc2   This text refers to water and alcohol              1       1        0     0
#3 doc3   This text refers to alcoolh                        0       0        0     0
#4 doc4   This text refers to gasoline and more gasoline     0       0        2     0
#5 doc5   This text refers to (h2o)                          0       0        0     1

In addition to the above method, an easier option is to use str_count

library(purrr)
map_dfc(set_names(keywords, keywords), ~ 
      str_count(test_df$text, .x)) %>% 
   bind_cols(test_df, .)
#  doc_id                                           text water alcohol gasoline (h2o)
#1   doc1                      This text refers to water     1       0        0     0
#2   doc2          This text refers to water and alcohol     1       1        0     0
#3   doc3                    This text refers to alcoolh     0       0        0     0
#4   doc4 This text refers to gasoline and more gasoline     0       0        2     0
#5   doc5                      This text refers to (h2o)     0       0        0     1

Or using base R

test_df[keywords] <-  lapply(keywords, function(x) 
        lengths(regmatches(test_df$text, gregexpr(x, test_df$text))))

Although str_extract is vectorized for pattern, it would be in such a way that the pattern length would be the same as the column length and it would be doing a correspoinding extraction

Upvotes: 3

Related Questions