Reputation: 3923
Related questions here R/dplyr: How to only keep integers in a data frame? and here R extract first number from string but this attempts to capture a relatively common case in some of the social sciences.
Imagine data as in the example below that are a mixture of text and numerals in one field. Perhaps the result of some survey process. You want to extract all the numbers reliably and ignore the text but if there are multiple numbers, perhaps a range of values you want to take the mean
or median
or some other numeric function to give you one "best" answer.
library(dplyr)
library(stringr)
library(purrr)
Years <- c("0 to 1 year",
"5 yrs",
"About 10 yrs",
"20 yrs",
"4 yrs",
"4-5 yrs",
"75 to 100 YEARS old",
">1 yearsmispelled or whatever")
df <- data.frame(Years)
Upvotes: 2
Views: 657
Reputation: 388907
You don't need to check for the length of the extracted number because mean
/median
/sum
of 1 number is the same number.
sapply(stringr::str_extract_all(df$Years, '\\d+'),function(x) mean(as.numeric(x)))
#[1] 0.5 5.0 10.0 20.0 4.0 4.5 87.5 1.0
To do this using pipe
:
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(NewYear = str_extract_all(Years, '\\d+'),
NewYear = map_dbl(NewYear, ~mean(as.numeric(.x))))
# Years NewYear
#1 0 to 1 year 0.5
#2 5 yrs 5.0
#3 About 10 yrs 10.0
#4 20 yrs 20.0
#5 4 yrs 4.0
#6 4-5 yrs 4.5
#7 75 to 100 YEARS old 87.5
#8 >1 yearsmispelled or whatever 1.0
Upvotes: 2
Reputation: 3923
Although I know solutions are available using other regular expressions and with nothing more than base
tools, I will present a tidyverse
solution. Trivial to substitute in many other numeric operators such as median
, and this solution will work no matter how many numbers are found.
library(dplyr)
library(stringr)
df <- df %>%
mutate(
NewYears = str_extract_all(Years, "[0-9]+") %>%
purrr::map(
~ifelse(
length(.x) == 1,
as.numeric(.x),
mean(unlist(as.numeric(.x)))
)
)
)
df
#> Years NewYears
#> 1 0 to 1 year 0.5
#> 2 5 yrs 5
#> 3 About 10 yrs 10
#> 4 20 yrs 20
#> 5 4 yrs 4
#> 6 4-5 yrs 4.5
#> 7 75 to 100 YEARS old 87.5
#> 8 >1 yearsmispelled or whatever 1
Upvotes: 4