Maximilian
Maximilian

Reputation: 4229

Re-format scraped date in R

I have scraped HTML and now I have rows like this:

                               rows
1: for the Year Ended 31 March 2013

I would like to extract only the expression "31 March 2013". The text around the expression could vary. The expression is to be turned into date format, preferably 31-3-2013

How to go about this?

Upvotes: 2

Views: 75

Answers (3)

Sven Hohenstein
Sven Hohenstein

Reputation: 81713

If there are no other numbers in your strings, you can use the following approach:

string <- "for the Year Ended 31 March 2013"

format(as.Date(sub(".*?(\\d+ \\w+ \\d+).*", "\\1", string), 
               "%d %B %Y"), "%d-%m-%Y")
# [1] "31-03-2013"

Here sub extracts the relevant substring, as.Date creates a object representing Date values, and format changes the order of the date elements.


It also works with additional text and one-digit days:

string <- c("for the Year Ended 31 March 2013",
            "1 January 2013 the Year Began",
            "for the Year Ended 31 March 2013 and not now")
format(as.Date(sub(".*?(\\d+ \\w+ \\d+).*", "\\1", string),
       "%d %b %Y"), "%d-%m-%Y")
# [1] "31-03-2013" "01-01-2013" "31-03-2013"

Upvotes: 3

agstudy
agstudy

Reputation: 121598

Another option :

library(stringr)
library(lubridate)
dmy(str_extract(xx,'[0-9]{2}.*[0-9]{4}$'))
[1] "2013-03-31 UTC"

Upvotes: 2

lukeA
lukeA

Reputation: 54247

rows <- c("for the Year Ended 31 March 2013 ... 31 March 2013 ...",
          "for the Year Ended 1 December 2011")
m <- gregexpr("[0-9]+ [A-z]+ [0-9]{4}", rows)
# Sys.setlocale("LC_TIME", "english")
(res <- lapply(regmatches(rows, m), as.Date, "%d %B %Y"))
# [[1]]
# [1] "2013-03-31" "2013-03-31"
# 
# [[2]]
# [1] "2011-12-01"
lapply(res, format.Date, "%d-%m-%Y") # or "%d-%e-%Y"
# [[1]]
# [1] "31-03-2013" "31-03-2013"
# 
# [[2]]
# [1] "01-12-2011"

Upvotes: 1

Related Questions