Reputation: 457
I have a dataset on county executives and their year of inaguration. I need break down which year each executive was inaugurated.
The problem is that the notation under the "year" variable is inconsistent.
For instance, let's say I start with this:
df <- data.frame(year= c(2000, "from 2001 to 2002", "01-feb-2003", 2000, "01-jan-2002", "from 2004 to 2005"),
executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
district= rep(c(1001, 1002), each=3))
I want it to look like this
df.neat <- data.frame(year= c(2000, 2001, 2003, 2000, 2002, 2004),
executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
district= rep(c(1001, 1002), each=3))
Note how the innaguration cycle does not always align (2000, 2001, and 2003 for district 1001 and 2000, 2002, and 2004 for district 1002).
Upvotes: 1
Views: 57
Reputation: 783
A solution in base R:
within(df, {
match_ind <- regexpr("\\d{4}", year)
year <- substr(year, match_ind, match_ind + 3)
rm(match_ind)
})
# output
year executive.name district
1 2000 Johnson 1001
2 2001 Smith 1001
3 2003 Alleghany 1001
4 2000 Roberts 1002
5 2002 Clarke 1002
6 2004 Tollson 1002
Upvotes: 1
Reputation: 18632
library(dplyr)
library(stringr)
df |>
mutate(year = as.numeric(str_extract(year, "\\d{4}")))
# year executive.name district
# 1 2000 Johnson 1001
# 2 2001 Smith 1001
# 3 2003 Alleghany 1001
# 4 2000 Roberts 1002
# 5 2002 Clarke 1002
# 6 2004 Tollson 1002
Upvotes: 2