Reputation: 345
I have a data frame that has years in it (data type chr
):
Years:
5 yrs
10 yrs
20 yrs
4 yrs
I want to keep only the integers to get a data frame like this (data type num
):
Years:
5
10
20
4
How do I do this in R?
Upvotes: 4
Views: 695
Reputation: 5788
Base R solution:
clean_years <- as.numeric(gsub("\\D", "", Years))
Data:
Years <- c("5 yrs",
"10 yrs",
"20 yrs",
"4 yrs",
"5 yrs")
Upvotes: 1
Reputation: 3923
Per your additional requirements a more general purpose solution but it has limits too. The nice thing about the more complicated years3
solution is it deals more gracefully with unexpected but quite possible answers.
library(dplyr)
library(stringr)
library(purrr)
Years <- c("5 yrs",
"10 yrs",
"20 yrs",
"4 yrs",
"4-5 yrs",
"75 to 100 YEARS old",
">1 yearsmispelled or whatever")
df <- data.frame(Years)
# just the numbers but loses the -5 in 4-5
df$Years1 <- as.numeric(sub("(\\d{1,4}).*", "\\1", df$Years))
#> Warning: NAs introduced by coercion
# just the numbers but loses the -5 in 4-5 using str_extract
df$Years2 <- str_extract(df$Years, "[0-9]+")
# a lot more needed to account for averaging
df$Years3 <- str_extract_all(df$Years, "[0-9]+") %>%
purrr::map( ~ ifelse(length(.x) == 1,
as.numeric(.x),
mean(unlist(as.numeric(.x)))))
df
#> Years Years1 Years2 Years3
#> 1 5 yrs 5 5 5
#> 2 10 yrs 10 10 10
#> 3 20 yrs 20 20 20
#> 4 4 yrs 4 4 4
#> 5 4-5 yrs 4 4 4.5
#> 6 75 to 100 YEARS old 75 75 87.5
#> 7 >1 yearsmispelled or whatever NA 1 1
Upvotes: 1
Reputation: 4358
you need to extract the numbers and treat them as type numeric
df$year <- as.numeric(sub(" yrs", "", df$year))
Upvotes: 4