Reputation: 33
I would like to extract the LAST 4 digits in a given string, but can't figure it out. The LAST 4 digits could be "XXXX" or "XXXX-". Ultimately, I have a list of heterogeneous entries that include single years (i.e., 2001- or 2001), lists of years (i.e., 2001, 2004-), range of years (i.e., 2001-2010), or a combination of these with or without a dash ("-") at the end of the entry.
I realize that '$' is the token to identify the END, and '^' is used to identify the START in regular expressions. I'm able to extract the FIRST 4 easily. Here is an example of what I'm able to do and the code that is not working for the LAST 4 digits:
library(stringr)
test <- c("2009-", "2008-2015", "2001-, 2003-2010, 2012-")
str_extract_all(test, "^[[:digit:]]{4}") # Extracts FIRST 4
[[1]]
[1] "2009" "2008" "2001"
str_extract_all(test, "[[:digit:]]{4}$") # Does not extract LAST 4
[[1]]
character(0)
[[2]]
"2015"
[[3]]
character(0)
str_extract_all(test, "\\d{4}$")
[[1]]
character(0)
[[2]]
"2015"
[[3]]
character(0)
The result I desire is:
[1] "2009" "2015" "2012"
Upvotes: 1
Views: 4550
Reputation: 887138
We can try with sub
sub(".*(\\d+{4}).*$", "\\1", test)
#[1] "2009" "2015" "2012"
Upvotes: 4