Reputation: 68
I have data frame of drugs (df
) and their associated information in a text
column with a number of headings (two of which are provided as examples). I need to split the text and have the according text in separate columns (as provided in the required
data frame)
heads <- c("Indications", "Administration")
df <- data.frame(drugs = c("acetaminophen", "prednisolone"), text = c("Indications1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\nAdministration\nUsually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.", "Indications \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\nAdministration\nGeneralDosage depends on the condition of indications and the patient response."))
required <- data.frame(drugs = c("acetaminophen", "prednisolone"), Indications = c(c("Pain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.", "Treatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.")), Administration = c("Usually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.", "GeneralDosage depends on the condition of indications and the patient response."))
strsplit
This gives me a list but I don't have the headings and because of the fact that not all drug have all of the headings this doesn't work.
Also I don't know how to incorporate it into the existing df
library(rebus)
head.rx <- sapply(heads, function(x) as.regex(x) %R% free_spacing(any_char(0,3)) %R% newline(1,2)) %R% optional(space(0,3))
split <- strsplit(df$text[1], or1(head.rx), perl = T))
To extract the text in between (sorry if it's very preliminary ... I'm not so good at custom functions)
extract_heading <- function(text){
#-1 is because I thought It would throw an error for the last heading
extract.list <- vector(mode = "list", length = length(heads)-1)
names(extract.list) <- heads[1:length(heads)-1]
for (i in 1:length(heads)-1) {
#the start and end regexes (based on the text to capture only the headings)
start <- as.regex(heads[i]) %R% free_spacing(any_char(0,3)) %R% newline(1,2)
end <- as.regex(heads[i+1]) %R% free_spacing(any_char(0,3)) %R% newline(1,2)
#the strings that need to be extracted (from one heading to the next)
rx <- start %R% free_spacing(any_char(3,5000)) %R% lookahead(end)
#extract
extract.list[i] <- stri_extract_first_regex(text, rx)
}
extract.list
}
##tried to see if it works (it gives me all NAs)
extract_heading(df$text[1])
map
functionBut can't figure out how to do it.
head.extract <- sapply(heads, function(x) x %R% free_spacing(any_char(3,9000)) %R% heads[which(heads ==x) +1])
purrr:: map2(df$text[1], head.extract, stri_extract_first_regex(df$text[1], head.extract))
I appreciate your help in advance.
Upvotes: 0
Views: 394
Reputation: 91
A for loop option here (guess not so friendly).
but first:
Need to assume the heading names, its order, and that they do not repeat on the content.
if so:
n<-c("Indications","Administration")
df1<-df["drugs"]
df1[,n]<-NA
for (i in length(n):1){
#For the first heading
if (i == 1){
df1[,n[1]]<-df$text[grepl(n[1],df$text)]
df1[,n[1]]<- gsub("\n"," ",df1[,n[1]])
df1[,n[1]]<-sub(paste0(".*",n[1]," (.+)",n[2]," .*"),"\\1",df1[,n[1]])
df1[,n[1]]<- gsub(n[1]," ",df1[,n[1]])
}else{
#For the last heading
if (i == length(n)){
df1[,n[length(n)]]<-df$text[grepl(n[length(n)],df$text)]
df1[,n[length(n)]]<- gsub("\n"," ",df1[,n[length(n)]])
df1[,n[length(n)]]<-sub(paste0(".*",n[length(n)]," (.+)"),"\\1",df1[,n[length(n)]])
df1[,n[length(n)]]<- gsub(n[length(n)]," ",df1[,n[length(n)]])
}else{
#Remaining headings
df1[,n[i]]<-df$text[grepl(n[i],df$text)]
df1[,n[i]]<- gsub("\n"," ",df1[,n[i]])
df1[,n[i]]<-sub(paste0(".*",n[i]," (.+)",n[i+1]," .*"),"\\1",df1[,n[i]])
df1[,n[i]]<- gsub(n[i]," ",df1[,n[i]])
}
}
}
Upvotes: 1
Reputation: 102299
A base R option using strsplit
with(
df,
cbind(df,
setNames(
as.data.frame(
do.call(rbind,strsplit(
text,
split = sprintf("(%s).*?\\n",paste0(heads,collapse = "|")),
perl = TRUE
))[,-1]),
heads))
)
gives
drugs
1 acetaminophen
2 prednisolone
text
1 Indications1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\nAdministration\nUsually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.
2 Indications \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on
blood and lymphatic systems in the palliative treatment of various diseases.\nAdministration\nGeneralDosage depends on the condition of indications and the patient response.
Indications
1 Pain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\n
2 Treatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\n
Administration
1 Usually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.
2 GeneralDosage depends on the condition of indications and the patient response.
Upvotes: 2
Reputation: 12430
So let’s start with the main function and the regular expression. I
would use stringi
’s stri_extract_all_regex
for this but
stringr::str_extract_all()
would also work if you find that easier. Or
you can use regmatches
with regexpr
or gregexpr
if you are
determined to stay in base R
(see here, for
example).
My suggestion for the regular expression is shown below:
library(stringi)
stri_extract_all_regex(
df$text,
"(?<=Indications)[\\s\\S]+(?=Administration)"
)
## [[1]]
## [1] "1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\n"
##
## [[2]]
## [1] " \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\n"
The individual parts:
(?<=Indications)
is a lookbehind meaning it matches the position
following ‘Indications’[\\s\\S]
matches any character (.
matches any character except
\n
, which is essential here)+?
indicates we want at least 1 character matching [\\s\\S]
, we
also use lazy matching to get shorter strings.(?=Administration)
is a lookbehind meaning it matches the position
followed by ‘Administration’This means we extract the string between and not including ‘Indications’ and ‘Administration’.
Next, we want to wrap this in a function to make it more flexible:
extract_between <- function(str, string_1, string_2) {
unlist(stri_extract_all_regex(
df$text,
paste0("(?<=", string_1, ")[\\s\\S]+?(?=", string_2, ")")
))
}
The function extracts all characters between but not including
string_1
and string_2
. Try it out if you like.
Finally, we want to create a new column for each headline. I use a
simple for loop for this. You could use lapply
to maybe make it more
efficient, but I did not test if that would improve anything and it
makes the code less readable.
# for the final match, we need the $ which represents the end of the string
heads_new <- c(heads, "$")
for (i in seq_len(length(heads_new) - 1)) {
df[[heads_new[i]]] <- extract_between(
df$text,
string_1 = heads_new[i],
string_2 = heads_new[i + 1]
)
}
# for nicer printing
tibble::as_tibble(df)
## # A tibble: 2 × 4
## drugs text Indications Administration
## <chr> <chr> <chr> <chr>
## 1 acetaminophen "Indications1\nP… "1\nPain\nSymptomati… "\nUsually administered…
## 2 prednisolone "Indications \nT… " \nTreatment of a w… "\nGeneralDosage depend…
This assumes the headlines are in the correct order and you know that
order. You can change the behaviour by using all headlines as string2
at the same time so matching stops as soon as R encounters another
headline (this is the reason I use lazy mode, i.e. ?
, above). I would
say that will generally produce more issues as your headlines might
occur elsewhere in the text, so I would prefer the first approach if
possible:
extract_between(
df$text,
heads_new[1],
paste0("(", paste0(heads_new, collapse = "|"), ")")
)
## [1] "1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\n"
## [2] " \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\n"
Upvotes: 2