skumar
skumar

Reputation: 353

Extract dates from strings in R

I have a vector of strings as shown below. I want to extract dates.

check_values <- c("deficit based on wage statement 7/14/ to 7/17/2015",
                "Deficit Due: $1205.73 -$879.63= $326.10 x 70%=$228.2",
                "Deficit Due for 12 wks pd - 7/14/15 thru 10/5/15;",
                "Deficit due to wage statement: 4/22/15 thru 5/12/15",
                "depos transcript 7/10/15 for 7/8/15 depos",
                "difference owed for 4/25/15-5/22/15",
                "tpd 4:30:2015 - 5:22:2015",
                "Medical TREATMENT DATES:  6/30/2015 -  6/30/2015",
                "4/25/15-5/22/15",
                "Medical")

                check_values <- as.data.table(check_values)
                names(check_values) <- "check_memo"

Before extracting dates, I want to create a Date_Flag. I have used following code but it gives different output:

check_values$Date_Flag <- ifelse(grepl("([0-9]+)(/|-|:)([0-9]+)(/|-|:)([0-9]+)(/|-|:)", check_values$check_memo), 1, 0)
 #[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
 #Expected Output: FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE

After this Date_Flag is created, I want to extract dates (both parts). Can somebody please tell me what is wrong with above regular regression?

Thanks

Upvotes: 2

Views: 1125

Answers (1)

akrun
akrun

Reputation: 887108

We can use str_count to create the 'Date_Flag' where the assumption is that if there are 2 full dates in each element of 'check_memo', we get TRUE or else FALSE.

library(data.table)
library(stringr)
pat <- "[0-9]{1,2}[/:][0-9]{1,2}[/:][0-9]{2,4}"
check_values[,Date_Flag := str_count(check_memo, pat)==2]
check_values
#                                             check_memo Date_Flag
#1:   deficit based on wage statement 7/14/ to 7/17/2015     FALSE
#2: Deficit Due: $1205.73 -$879.63= $326.10 x 70%=$228.2     FALSE
#3:    Deficit Due for 12 wks pd - 7/14/15 thru 10/5/15;      TRUE
#4:  Deficit due to wage statement: 4/22/15 thru 5/12/15      TRUE
#5:            depos transcript 7/10/15 for 7/8/15 depos      TRUE
#6:                  difference owed for 4/25/15-5/22/15      TRUE
#7:                            tpd 4:30:2015 - 5:22:2015      TRUE
#8:     Medical TREATMENT DATES:  6/30/2015 -  6/30/2015      TRUE
#9:                                      4/25/15-5/22/15      TRUE
#10:                                             Medical     FALSE

If we need the Dates extracted, use the same pattern with str_extract_all

check_values[(Date_Flag),  paste0("Date", 1:2) := 
                  transpose(str_extract_all(check_memo, pat))]

check_values
                                              check_memo #Date_Flag     Date1     Date2
# 1:   deficit based on wage statement 7/14/ to 7/17/2015     FALSE        NA        NA
# 2: Deficit Due: $1205.73 -$879.63= $326.10 x 70%=$228.2     FALSE        NA        NA
# 3:    Deficit Due for 12 wks pd - 7/14/15 thru 10/5/15;      TRUE   7/14/15   10/5/15
# 4:  Deficit due to wage statement: 4/22/15 thru 5/12/15      TRUE   4/22/15   5/12/15
# 5:            depos transcript 7/10/15 for 7/8/15 depos      TRUE   7/10/15    7/8/15
# 6:                  difference owed for 4/25/15-5/22/15      TRUE   4/25/15   5/22/15
# 7:                            tpd 4:30:2015 - 5:22:2015      TRUE 4:30:2015 5:22:2015
# 8:     Medical TREATMENT DATES:  6/30/2015 -  6/30/2015      TRUE 6/30/2015 6/30/2015
# 9:                                      4/25/15-5/22/15      TRUE   4/25/15   5/22/15
#10:                                              Medical     FALSE        NA        NA

Upvotes: 3

Related Questions