Extract string and its location using dplyr/tidyr approach

Question

The input data frame has three id columns and one raw_text. u_id corresponds to user, doc_id corresponds to the document of a particular user and sentence id corresponds to a sentence within a document of a user.

df <- data.frame(u_id=c(1,1,1,1,1,2,2,2),
                 doc_id=c(1,1,1,2,2,1,1,2),
                 sent_id=c(1,2,3,1,2,1,2,1),
                 text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                               "pertinent results: 2105-4-16 05:02pm gap-14 
                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                               "method exists and the former because calls to the corresponding",
                        "admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                        "pertinent results: 2105-4-16 05:02pm gap-14 
                        2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                        "method exists and the former because calls to the corresponding",
                        "method exists and the former because calls to the corresponding",
                        "method exists and the former because calls to the corresponding"))

Let's assume we need to extract all the dates and its location from raw_text. My approach so far -

#define a regex for date
date<-"([0-9]{2,4})[- . /]([0-9]{1,4})[- . /]([0-9]{2,4})"

#library
library(dplyr)
library(stringr)

#extract dates
df_i<-df %>% 
  mutate(i=str_extract_all(text,date)) %>% 
  mutate(date=lapply(i, function(x) if(identical(x, character(0))) NA_character_ else x)) %>% 
  unnest(date)

#extract date locations
df_ii<-str_locate_all(df$text,date)
n<-max(sapply(df_ii, nrow))
date_loc<-as.data.frame(do.call(rbind, lapply(df_ii, function (x) 
  rbind(x, matrix(, n-nrow(x), ncol(x))))))

The date extractions are in data.frame format. Is there an approach to put the string_locations in a data.frame format corresponding to its id and string? Ideally, the output should be -

output<-data.frame(id=c(1,1,2,2,3),
               text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                      "admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                      "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                      "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                      "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 ."),
               date=c("2001-4-19","2002-5-23","2105-4-16","2105-4-16","13.1 2105"),
               date_start=c(17,43,20,74,96),
               date_end=c(25,51,28,82,104))

acylam · Accepted Answer

You can do this:

regex = "\b[0-9]+[-][0-9]+[-][0-9]+\b"
df_i = str_extract_all(df$text, regex) 
df_ii = str_locate_all(df$text, regex) 

output1 = Map(function(x, y, z){
  if(length(y) == 0){
    y = NA
  }
  if(nrow(z) == 0){
    z = rbind(z, list(start = NA, end = NA))
  }
  data.frame(id = x, date = y, z)
}, df$id, df_i, df_ii) %>%
  do.call(rbind,.) %>%
  merge(df, .)

or stick with piping-only syntax:

regex = "[0-9]+[-][0-9]+[-][0-9]+"

output1 = df %>%
  {list(.$id, str_extract_all(.$text, regex), 
       str_locate_all(.$text, regex))} %>%
  {Map(function(x, y, z){
    if(length(y) == 0){
      y = NA
    }
    if(nrow(z) == 0){
      z = rbind(z, list(start = NA, end = NA))
    }
    data.frame(id = x, date = y, z)
  }, .[[1]], .[[2]], .[[3]])} %>%
  do.call(rbind, .) %>%
  merge(df, .)

Result:

  id
1  1
2  1
3  2
4  2
5  2
6  3
                                                                                                                 text
1                                                        admission date: 2001-4-19 discharge date: 2002-5-23 service:
2                                                        admission date: 2001-4-19 discharge date: 2002-5-23 service:
3 pertinent results: 2105-4-16 05:02pm gap-14 
                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
4 pertinent results: 2105-4-16 05:02pm gap-14 
                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
5 pertinent results: 2105-4-16 05:02pm gap-14 
                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
6                                                     method exists and the former because calls to the corresponding
       date start end
1 2001-4-19    17  25
2 2002-5-23    43  51
3 2105-4-16    20  28
4 2105-4-16    77  85
5 2105-4-16   104 112
6          NA  NA

Notes:

Your regular expression incorrectly extracts "13.1" from "rdw-13.1 2105-4-16" because you added spaces in your [- . /]. date<-"([0-9]{2,4})[-./]([0-9]{1,4})[-./]([0-9]{2,4})" should do it.
mutate allows you to use a variable you have just created inside the same function call, so there is no need to use two separate mutate's for df_i.
For my pipping-only solution, {} are needed around list() and Map() to override the dplyr default of feeding in the output from the previously step to the first argument of the next function.

For instance:

df %>%
      list(.$id, str_extract_all(.$text, regex), 
                 str_locate_all(.$text, regex))

becomes:

list(df, df$id, str_extract_all(df$text, regex), 
                str_locate_all(df$text, regex))

which is not what we want.

Edits:

OP updated his df to include rows where text does not include any dates. This would cause my original solution to fail since some elements of the list from str_extract_all and str_locate_all would have length(0) and nrow(0). I solved this issue by adding two if statements:

if(length(y) == 0){
  y = NA
}
if(nrow(z) == 0){
  z = rbind(z, list(start = NA, end = NA))
}

This makes dates = "NA and adds a row of NA's to start and end for those rows with no dates. This allows id to have one row to bind to in the data.frame step.

Extract string and its location using dplyr/tidyr approach

Answers (1)

Related Questions