Reputation: 68

Split text by headings as delimiters and save as dataframe columns in R

I have data frame of drugs (df) and their associated information in a text column with a number of headings (two of which are provided as examples). I need to split the text and have the according text in separate columns (as provided in the required data frame)

heads <- c("Indications", "Administration")
df <- data.frame(drugs = c("acetaminophen", "prednisolone"), text = c("Indications1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\nAdministration\nUsually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.", "Indications \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\nAdministration\nGeneralDosage depends on the condition of indications and the patient response."))

required <- data.frame(drugs = c("acetaminophen", "prednisolone"), Indications = c(c("Pain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.", "Treatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.")), Administration = c("Usually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.", "GeneralDosage depends on the condition of indications and the patient response."))

What I've tried

Using `strsplit`

This gives me a list but I don't have the headings and because of the fact that not all drug have all of the headings this doesn't work. Also I don't know how to incorporate it into the existing df

library(rebus)

head.rx <- sapply(heads, function(x) as.regex(x) %R% free_spacing(any_char(0,3)) %R% newline(1,2)) %R% optional(space(0,3))
split <- strsplit(df$text[1], or1(head.rx), perl = T))

Getting start and end for each heading

To extract the text in between (sorry if it's very preliminary ... I'm not so good at custom functions)

extract_heading <- function(text){
  
  #-1 is because I thought It would throw an error for the last heading
  extract.list <- vector(mode = "list", length = length(heads)-1)
  names(extract.list) <- heads[1:length(heads)-1]
 
  for (i in 1:length(heads)-1) {
    
    #the start and end regexes (based on the text to capture only the headings)
    start <- as.regex(heads[i]) %R% free_spacing(any_char(0,3)) %R% newline(1,2)
    end <- as.regex(heads[i+1]) %R% free_spacing(any_char(0,3)) %R% newline(1,2)
    
    #the strings that need to be extracted (from one heading to the next)
    rx <- start %R% free_spacing(any_char(3,5000)) %R% lookahead(end)
    
    #extract
    extract.list[i] <- stri_extract_first_regex(text, rx)
  }
  extract.list
}
  
##tried to see if it works (it gives me all NAs)
extract_heading(df$text[1])

Use the `map` function

But can't figure out how to do it.

head.extract <- sapply(heads, function(x) x %R% free_spacing(any_char(3,9000)) %R% heads[which(heads ==x) +1])
purrr:: map2(df$text[1], head.extract, stri_extract_first_regex(df$text[1], head.extract))

I appreciate your help in advance.

Upvotes: 0

Answers (3)

BrunoPLC

Reputation: 91

A for loop option here (guess not so friendly).

but first:

Need to assume the heading names, its order, and that they do not repeat on the content.

if so:

n<-c("Indications","Administration")


df1<-df["drugs"]
df1[,n]<-NA



for (i in length(n):1){

#For the first heading
  if (i == 1){ 
    df1[,n[1]]<-df$text[grepl(n[1],df$text)]
    df1[,n[1]]<- gsub("\n"," ",df1[,n[1]])
    df1[,n[1]]<-sub(paste0(".*",n[1]," (.+)",n[2]," .*"),"\\1",df1[,n[1]])
    df1[,n[1]]<- gsub(n[1]," ",df1[,n[1]])
    
    }else{
      
      #For the last heading
      if (i == length(n)){ 
        df1[,n[length(n)]]<-df$text[grepl(n[length(n)],df$text)]
        df1[,n[length(n)]]<- gsub("\n"," ",df1[,n[length(n)]])
        df1[,n[length(n)]]<-sub(paste0(".*",n[length(n)]," (.+)"),"\\1",df1[,n[length(n)]])
        df1[,n[length(n)]]<- gsub(n[length(n)]," ",df1[,n[length(n)]])
        }else{
          
          #Remaining headings
          df1[,n[i]]<-df$text[grepl(n[i],df$text)]
          df1[,n[i]]<- gsub("\n"," ",df1[,n[i]])
          df1[,n[i]]<-sub(paste0(".*",n[i]," (.+)",n[i+1]," .*"),"\\1",df1[,n[i]])
          df1[,n[i]]<- gsub(n[i]," ",df1[,n[i]])
          }
    }
  }

Upvotes: 1

ThomasIsCoding

Reputation: 102299

A base R option using strsplit

with(
  df,
  cbind(df,
  setNames(
  as.data.frame(
    do.call(rbind,strsplit(
    text,
    split = sprintf("(%s).*?\\n",paste0(heads,collapse = "|")),
    perl = TRUE
  ))[,-1]),
  heads))
)

gives

          drugs
1 acetaminophen
2  prednisolone

                                                                                                                                                                         text
1                                         Indications1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\nAdministration\nUsually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.
2 Indications \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on 
blood and lymphatic systems in the palliative treatment of various diseases.\nAdministration\nGeneralDosage depends on the condition of indications and the patient response.

                                                     Indications
1                                                                                                               Pain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\n
2 Treatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\n
                                                                                                                                         Administration
1 Usually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.
2                                                                       GeneralDosage depends on the condition of indications and the patient response.

Upvotes: 2

JBGruber

Reputation: 12430

So let’s start with the main function and the regular expression. I would use stringi’s stri_extract_all_regex for this but stringr::str_extract_all() would also work if you find that easier. Or you can use regmatches with regexpr or gregexpr if you are determined to stay in base R (see here, for example).

My suggestion for the regular expression is shown below:

library(stringi)

stri_extract_all_regex(
  df$text,
  "(?<=Indications)[\\s\\S]+(?=Administration)"
)

## [[1]]
## [1] "1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\n"
## 
## [[2]]
## [1] " \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\n"

The individual parts:

(?<=Indications) is a lookbehind meaning it matches the position following ‘Indications’
[\\s\\S] matches any character (. matches any character except \n, which is essential here)
+? indicates we want at least 1 character matching [\\s\\S], we also use lazy matching to get shorter strings.
(?=Administration) is a lookbehind meaning it matches the position followed by ‘Administration’

This means we extract the string between and not including ‘Indications’ and ‘Administration’.

Next, we want to wrap this in a function to make it more flexible:

extract_between <- function(str, string_1, string_2) {
  unlist(stri_extract_all_regex(
    df$text,
    paste0("(?<=", string_1, ")[\\s\\S]+?(?=", string_2, ")")
  ))
}

The function extracts all characters between but not including string_1 and string_2. Try it out if you like.

Finally, we want to create a new column for each headline. I use a simple for loop for this. You could use lapply to maybe make it more efficient, but I did not test if that would improve anything and it makes the code less readable.

# for the final match, we need the $ which represents the end of the string
heads_new <- c(heads, "$")

for (i in seq_len(length(heads_new) - 1)) {
  df[[heads_new[i]]] <- extract_between(
    df$text,
    string_1 = heads_new[i],
    string_2 = heads_new[i + 1]
  )
}

# for nicer printing
tibble::as_tibble(df)

## # A tibble: 2 × 4
##   drugs         text              Indications           Administration          
##   <chr>         <chr>             <chr>                 <chr>                   
## 1 acetaminophen "Indications1\nP… "1\nPain\nSymptomati… "\nUsually administered…
## 2 prednisolone  "Indications \nT… " \nTreatment of a w… "\nGeneralDosage depend…

This assumes the headlines are in the correct order and you know that order. You can change the behaviour by using all headlines as string2 at the same time so matching stops as soon as R encounters another headline (this is the reason I use lazy mode, i.e. ?, above). I would say that will generally produce more issues as your headlines might occur elsewhere in the text, so I would prefer the first approach if possible:

extract_between(
  df$text,
  heads_new[1],
  paste0("(", paste0(heads_new, collapse = "|"), ")")
)

## [1] "1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\n"                                                                                                              
## [2] " \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\n"