Reputation: 4281
I have following column in my data set.
TENURE
April 30, 1789 – March 3,1797
March 4, 1797 - March 3, 1801
March 4, 1841 - April 4, 1841[Died]
March 4, 1881 - September 19, 1881[Assassinated]
January 20, 1969 - August 9, 1974[Resigned]
...
...
I have load the dataset into a dataframe with one of this column named TENURE. Now I want to make two more columns names "Start" and "End" based on the TENURE.Those two new columns would then be included into my dataframe. The result of two columns would look like this
Start End
1789 1797
1797 1901
1841 1841
1881 1881
1969 1974
So far I have done following
require(XML)
require(stringr)
urlPresidents<-"http://www.theholidayspot.com/july4/us_presidents.htm"
presidents <- readHTMLTable(urlPresidents,which = 3,
skip.rows = 1,header = TRUE,
stringsAsFactors=FALSE)
yearList <- str_split(presidents$TENURE,pattern = ",",n = 1)
I am strucked and not getting how to proceed?
Upvotes: 2
Views: 176
Reputation: 56
I think it can be done in three steps:
Devide 2date-string into two parts
a <- c('March 4, 1797 - March 3, 1801','March 4, 1841 - April 4, 1841[Died]')
a_devided <- strsplit(a,' - ')
Convert strings into date objects
a_devided_dates <- lapply(a_devided, function(x) as.Date( x, '%B %d, %Y') )
Extract years from dates:
lapply(a_devided_dates, function(x) format(x, '%Y'))
Upvotes: 2
Reputation: 18437
You can str_extract_all
and match all for digits number, in this case it works.
r <- str_extract_all(presidents$TENURE, "\\d{4}")
df <- data.frame(start = sapply(r, "[", 1), end = sapply(r, "[", 2))
head(df)
## start end
## 1 1789 1797
## 2 1797 1801
## 3 1801 1809
## 4 1809 1817
## 5 1817 1825
## 6 1825 1829
Upvotes: 2