Reputation: 684
I have a dataframe created using readtext(). It has two columns: doc_id, text. For each row (doc_id) I want to extract a substring (in my case name of a government department) between two strings that are repeated n times in the text column. For example:
documents <- data.frame(doc_id = c("doc_1", "doc_2"),
text = c("PART 1 Department of Communications \n Matters \n Blah blah blah \n PART 2 Department of Forestry \n Matters \n Blah blah blah", "PART 1 Department of Communications \n Matters \n Blah blah blah \n PART 3 Department of Health \n Matters \n Blah blah blah \n PART 5 Department of Sport \n Matters \n Blah blah"))
What I would like is to get to is:
"doc_1" "Department of Communications, Department of Forestry"
"doc_2" "Department of Communications, Department of Health, Department of Sport"
Essentially I want to extract string between PART and Matters. I would like to use dplyr::rowwise operations on the dataframe but have no idea how to extract multiple times between two repeated strings.
Upvotes: 2
Views: 1191
Reputation: 335
#Import Tidyverse
library(tidyverse)
#Use helper variable name to store resuts of the extracted departments based on the parttern
Helper <- str_extract_all(string = documents$text, pattern = "Department.*\\n")
#Clean Up the columns.
Helper1 <- lapply(Helper, FUN = str_replace_all, pattern=" \\n", replacement = ", ")
documents$Departments<-str_replace(str_trim(unlist(lapply(Helper1, FUN =paste, collapse= ""))), pattern = ",$", replacement = "")
#Remove Previous column of texts
documents <- select(documents, -c("text"))
Upvotes: 0
Reputation: 389275
We can use str_match_all
from stringr
and extract the words in between "PART" and "Matters". It returns a list of two column matrices from which we select the second column which is the capture group and put them in one comma separated string using toString
.
out <- stringr::str_match_all(documents$text, "PART \\d+ (.*) \n Matters")
sapply(out, function(x) toString(x[, 2]))
#[1] "Department of Communications, Department of Forestry"
#[2] "Department of Communications, Department of Health, Department of Sport"
Upvotes: 3
Reputation: 11782
I can not think of a rowwise
solution right now, but maybe this helps as well
library(dplyr)
documents %>%
mutate(text=strsplit(as.character(text), 'PART ')) %>%
tidyr::unnest(text) %>%
mutate(text=trimws(sub('\\d+ (.*) Matters.*', '\\1', text))) %>%
filter(text != '') %>%
group_by(doc_id) %>%
summarise(text=paste(text, collapse=', '))
It basically splits all your text at PART
and then we can work on each element separate to cut the important text out of the longer string. Later we concatenate everything together per doc_id
.
Upvotes: 1