Edison
Edison

Reputation: 4281

String Split in R

I have following column in my data set.

TENURE

April 30, 1789 – March 3,1797
March 4, 1797 - March 3, 1801
March 4, 1841 - April 4, 1841[Died]
March 4, 1881 - September 19, 1881[Assassinated]
January 20, 1969 - August 9, 1974[Resigned]
...
...

I have load the dataset into a dataframe with one of this column named TENURE. Now I want to make two more columns names "Start" and "End" based on the TENURE.Those two new columns would then be included into my dataframe. The result of two columns would look like this

Start   End
1789    1797
1797    1901
1841    1841
1881    1881
1969    1974

So far I have done following

require(XML)
require(stringr)
urlPresidents<-"http://www.theholidayspot.com/july4/us_presidents.htm"
presidents <- readHTMLTable(urlPresidents,which = 3,
                            skip.rows = 1,header = TRUE,
                            stringsAsFactors=FALSE)
yearList <- str_split(presidents$TENURE,pattern = ",",n = 1)

I am strucked and not getting how to proceed?

Upvotes: 2

Views: 176

Answers (2)

cito
cito

Reputation: 56

I think it can be done in three steps:

  1. Devide 2date-string into two parts

    a <- c('March 4, 1797 - March 3, 1801','March 4, 1841 - April 4, 1841[Died]')
    a_devided <- strsplit(a,' - ')
    
  2. Convert strings into date objects

    a_devided_dates <- lapply(a_devided, function(x) as.Date( x, '%B %d, %Y') )
    
  3. Extract years from dates:

    lapply(a_devided_dates, function(x) format(x, '%Y'))
    

Upvotes: 2

dickoa
dickoa

Reputation: 18437

You can str_extract_all and match all for digits number, in this case it works.

r <- str_extract_all(presidents$TENURE, "\\d{4}")
df <- data.frame(start = sapply(r, "[", 1), end = sapply(r, "[", 2))
head(df)
##   start  end
## 1  1789 1797
## 2  1797 1801
## 3  1801 1809
## 4  1809 1817
## 5  1817 1825
## 6  1825 1829

Upvotes: 2

Related Questions