Leprechault
Leprechault

Reputation: 1823

Extract dates in a complex string

I have a problem for extract dates in files names, in my example a have the file.name object:

file.name<- c("AZAMBUJAI002A20190518T133231_20190518T133919_T22JCM_2021_05_19_01_18_22.tif","RINCAODOSSOARES051B20210107T133231_20190518T133919_T22JSM_2021_05_19_01_18_22",
"VILAPALMA33K20181018T133231_20190518T133919_T23JCM_2020_05_19_01_18_22.tif")

I need to extract in a new object the specific dates: 20190518, 20210107 and 20181018 inside in the files names. But for this a can't use substr because a have different lengths of areas names (AZAMBUJAI002A,RINCAODOSSOARES051B and VILAPALMA33K) and not to use remove letters too (a cause of numeric area id - 002, 051 and 33). The dates in the end before ".tif" separated by "_" is not useful information.

My desirable output is:

mydates
[1] 2019-05-18
[2] 2021-01-07
[3] 2018-10-18

Is there any solution to the problem described? Thanks!!

Upvotes: 0

Views: 786

Answers (3)

Sinh Nguyen
Sinh Nguyen

Reputation: 4497

Here is a way to extract using regex - assume you only have year start with 20xx

library(stringr)
library(lubridate)

date_string <- str_extract(file.name,
  "20\\d{2}\\[0,1][1-9]\\[0-3][1-9]")

date_string
#> [1] "20190518" "20210107" "20181018"

ymd(date_string)
#> [1] "2019-05-18" "2021-01-07" "2018-10-18"

Created on 2021-05-19 by the reprex package (v2.0.0)

Upvotes: 1

PavoDive
PavoDive

Reputation: 6496

library(lubridate)

ymd(gsub("(^.*_)(20[0-9]{2}_)([0-9]{2}_)([0-9]{2}_)(.*$)", 
         "\\2\\3\\4", 
         file.name))

ymd is a lubridate function that identifies YYYY-MM-DD dates, almost irrespective of the separator used.

gsub converts a string. The regex inside:

  • (^.*_) is the first capture group. Takes anything from the beginning to an underscore.
  • (20[0-9]{2}_) is the second capture group. It takes a string that starts with 20 and is followed by any two digits and an underscore.
  • ([0-9]{2}_) is the third and fourth capture groups. It takes two digits followed by an underscore.
  • (.*$) is the last (5th) capture group. Takes anything to the end of the string.
  • "\2\3\4" returns second, third and fourth capture groups.

EDIT:

The explanation to the code is still OK, but to retrieve the dates just after the names then the code needed is this:

ymd(gsub("(^.*[A-Z])(20[0-9]{2})([0-9]{2})([0-9]{2})(.*$)",
         "\\2\\3\\4", 
         file.name))

Upvotes: 0

dario
dario

Reputation: 6483

Solution using base R functions. Works as long as the format is always "yyyymmdd" and the relevant string appears before the first underscore:

file.name<- c("AZAMBUJAI002A20190518T133231_20190518T133919_T22JCM_2021_05_19_01_18_22.tif",
              "RINCAODOSSOARES051B20210107T133231_20190518T133919_T22JSM_2021_05_19_01_18_22",
              "VILAPALMA33K20181018T133231_20190518T133919_T23JCM_2020_05_19_01_18_22.tif")

Using gsub twice: First (in the inner function) to get rid of everything after the first underscore, and then to extract the sequence of eight numbers ([0-9]{8}:

dates <- gsub(".*([0-9]{8}).*", "\\1", gsub("^([^_]*)_.*", "\\1", file.name))

Finally using as.Date to convert the strings to a R date object (can be re-cast to a string using format):

dates_as_actual_date <- as.Date(dates, format("%Y%m%d"))
              

dates_as_actual_date is a R date object and looks like this:

[1] "2019-05-18" "2021-01-07" "2018-10-18"

Upvotes: 1

Related Questions