Reputation: 4229
I have scraped HTML and now I have rows like this:
rows
1: for the Year Ended 31 March 2013
I would like to extract only the expression "31 March 2013"
. The text around the expression could vary. The expression is to be turned into date format, preferably 31-3-2013
How to go about this?
Upvotes: 2
Views: 75
Reputation: 81713
If there are no other numbers in your strings, you can use the following approach:
string <- "for the Year Ended 31 March 2013"
format(as.Date(sub(".*?(\\d+ \\w+ \\d+).*", "\\1", string),
"%d %B %Y"), "%d-%m-%Y")
# [1] "31-03-2013"
Here sub
extracts the relevant substring, as.Date
creates a object representing Date
values, and format
changes the order of the date elements.
It also works with additional text and one-digit days:
string <- c("for the Year Ended 31 March 2013",
"1 January 2013 the Year Began",
"for the Year Ended 31 March 2013 and not now")
format(as.Date(sub(".*?(\\d+ \\w+ \\d+).*", "\\1", string),
"%d %b %Y"), "%d-%m-%Y")
# [1] "31-03-2013" "01-01-2013" "31-03-2013"
Upvotes: 3
Reputation: 121598
Another option :
library(stringr)
library(lubridate)
dmy(str_extract(xx,'[0-9]{2}.*[0-9]{4}$'))
[1] "2013-03-31 UTC"
Upvotes: 2
Reputation: 54247
rows <- c("for the Year Ended 31 March 2013 ... 31 March 2013 ...",
"for the Year Ended 1 December 2011")
m <- gregexpr("[0-9]+ [A-z]+ [0-9]{4}", rows)
# Sys.setlocale("LC_TIME", "english")
(res <- lapply(regmatches(rows, m), as.Date, "%d %B %Y"))
# [[1]]
# [1] "2013-03-31" "2013-03-31"
#
# [[2]]
# [1] "2011-12-01"
lapply(res, format.Date, "%d-%m-%Y") # or "%d-%e-%Y"
# [[1]]
# [1] "31-03-2013" "31-03-2013"
#
# [[2]]
# [1] "01-12-2011"
Upvotes: 1