aterhorst
aterhorst

Reputation: 684

extract substrings of text between two repeating strings

I have a dataframe created using readtext(). It has two columns: doc_id, text. For each row (doc_id) I want to extract a substring (in my case name of a government department) between two strings that are repeated n times in the text column. For example:

documents <- data.frame(doc_id = c("doc_1", "doc_2"),
                        text = c("PART 1 Department of Communications \n Matters \n Blah blah blah \n PART 2 Department of Forestry \n Matters \n Blah blah blah", "PART 1 Department of Communications \n Matters \n Blah blah blah \n PART 3 Department of Health \n Matters \n Blah blah blah \n PART 5 Department of Sport \n Matters \n Blah blah"))

What I would like is to get to is:

"doc_1"  "Department of Communications, Department of Forestry"
"doc_2"  "Department of Communications, Department of Health, Department of Sport"

Essentially I want to extract string between PART and Matters. I would like to use dplyr::rowwise operations on the dataframe but have no idea how to extract multiple times between two repeated strings.

Upvotes: 2

Views: 1191

Answers (3)

Lunalo John
Lunalo John

Reputation: 335

#Import Tidyverse
library(tidyverse)

#Use helper variable name to store resuts of the extracted departments based on the parttern
Helper <- str_extract_all(string = documents$text, pattern = "Department.*\\n")

#Clean Up the columns.
Helper1 <- lapply(Helper, FUN = str_replace_all, pattern=" \\n", replacement = ", ")
documents$Departments<-str_replace(str_trim(unlist(lapply(Helper1, FUN =paste, collapse= ""))), pattern = ",$", replacement = "")

#Remove Previous column of texts
documents <- select(documents, -c("text"))

This yields enter image description here

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389275

We can use str_match_all from stringr and extract the words in between "PART" and "Matters". It returns a list of two column matrices from which we select the second column which is the capture group and put them in one comma separated string using toString.

out <- stringr::str_match_all(documents$text, "PART \\d+ (.*) \n Matters")
sapply(out, function(x) toString(x[, 2]))

#[1] "Department of Communications, Department of Forestry"                   
#[2] "Department of Communications, Department of Health, Department of Sport"

Upvotes: 3

drmariod
drmariod

Reputation: 11782

I can not think of a rowwise solution right now, but maybe this helps as well

library(dplyr)
documents %>%
  mutate(text=strsplit(as.character(text), 'PART ')) %>%
  tidyr::unnest(text) %>%
  mutate(text=trimws(sub('\\d+ (.*) Matters.*', '\\1', text))) %>%
  filter(text != '') %>%
  group_by(doc_id) %>%
  summarise(text=paste(text, collapse=', '))

It basically splits all your text at PART and then we can work on each element separate to cut the important text out of the longer string. Later we concatenate everything together per doc_id.

Upvotes: 1

Related Questions