sm002
sm002

Reputation: 101

Extracting dates following a specific word from a column of strings using dplyr

I am trying to extract the most recent date that a report was added in an R dataframe of reports. The text always looks like Date Ordered: M/DD/YYYY and may contain 0 many times in a given report. If it's repeating, I want the most recent (usually the last) instance, and I'm trying to convert it to a date in a mutated dplyr column.

Using the code below on my actual dataframe, I get the error:

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { :
missing value where TRUE/FALSE needed

However, it runs fine on a single item making me think that it's trying to concatenate the entire column.

The test code doesn't give an error, but actually pulls the last date from the last report for all instances:

     lastdate
1 1999-04-15
2 1999-04-15
dataset=data.frame(cbind(ID=c(001,002),
                         Report=c("Blah Blah Date Ordered: 5/19/2000 test is positive. Date Ordered: 4/2/2005 Additional testing negative.",
                                  "Meh Date Ordered: 4/15/1999")),
                   stringsAsFactors = F)`

dataset %>% 
  mutate(lastdate = as.Date(last(gsub("Date Ordered:\\s+", "",
                                      strapplyc(Report, 
                                                "Date Ordered:\\s*\\d+/\\d+/\\d+", simplify = TRUE))),
                            "%m/%d/%Y"))

Desired output should be:

2005-4-2
1999-4-15

Actual for dataset:

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { : 
  missing value where TRUE/FALSE needed

Actual for test data:

    lastdate
1 1999-04-15
2 1999-04-15

Upvotes: 0

Views: 403

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627100

I suggest a gsub like

dataset$lastsdate <- as.Date(gsub(".*Date Ordered:\\s*(\\d{1,2}/\\d{1,2}/\\d{4}).*|.*","\\1", dataset$Report),"%m/%d/%Y")

See the regex in action.

The regex matches:

  • .* - any 0+ chars as many as possible
  • Date Ordered: - a literal substring
  • \s* - 0+ whitespaces
  • (\d{1,2}/\d{1,2}/\d{4}) - Capturing group 1 (\1): 1 or 2 digits, /, 1 or 2 digits, /, 4 digits
  • .* - the rest of the string
  • | - or
  • .* - the entire string.

Upvotes: 0

Related Questions