YouLocalRUser
YouLocalRUser

Reputation: 457

dataframe breakdown by year

I have a dataset on county executives and their year of inaguration. I need break down which year each executive was inaugurated.

The problem is that the notation under the "year" variable is inconsistent.

For instance, let's say I start with this:

df <- data.frame(year= c(2000, "from 2001 to 2002", "01-feb-2003", 2000, "01-jan-2002", "from 2004 to 2005"),
                  executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
                  district= rep(c(1001, 1002), each=3))

I want it to look like this

df.neat <- data.frame(year= c(2000, 2001, 2003, 2000, 2002, 2004),
                  executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
                  district= rep(c(1001, 1002), each=3))

Note how the innaguration cycle does not always align (2000, 2001, and 2003 for district 1001 and 2000, 2002, and 2004 for district 1002).

Upvotes: 1

Views: 57

Answers (2)

nightstand
nightstand

Reputation: 783

A solution in base R:

within(df, {
    match_ind <- regexpr("\\d{4}", year)
    year <- substr(year, match_ind, match_ind + 3)
    rm(match_ind)
})

# output
  year executive.name district
1 2000        Johnson     1001
2 2001          Smith     1001
3 2003      Alleghany     1001
4 2000        Roberts     1002
5 2002         Clarke     1002
6 2004        Tollson     1002

Upvotes: 1

LMc
LMc

Reputation: 18632

library(dplyr)
library(stringr)

df |>
  mutate(year = as.numeric(str_extract(year, "\\d{4}")))
#   year executive.name district
# 1 2000        Johnson     1001
# 2 2001          Smith     1001
# 3 2003      Alleghany     1001
# 4 2000        Roberts     1002
# 5 2002         Clarke     1002
# 6 2004        Tollson     1002

Upvotes: 2

Related Questions