Chuck P
Chuck P

Reputation: 3923

Extract numbers from a string and then manipulate as numbers (optionally) before creating new column

Related questions here R/dplyr: How to only keep integers in a data frame? and here R extract first number from string but this attempts to capture a relatively common case in some of the social sciences.

Imagine data as in the example below that are a mixture of text and numerals in one field. Perhaps the result of some survey process. You want to extract all the numbers reliably and ignore the text but if there are multiple numbers, perhaps a range of values you want to take the mean or median or some other numeric function to give you one "best" answer.

library(dplyr)
library(stringr)
library(purrr)
Years <- c("0 to 1 year",
           "5 yrs",
           "About 10 yrs",
           "20 yrs",
           "4 yrs",
           "4-5 yrs",
           "75 to 100 YEARS old",
           ">1 yearsmispelled or whatever")
df <- data.frame(Years)


Upvotes: 2

Views: 657

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388907

You don't need to check for the length of the extracted number because mean/median/sum of 1 number is the same number.

sapply(stringr::str_extract_all(df$Years, '\\d+'),function(x) mean(as.numeric(x)))
#[1]  0.5  5.0 10.0 20.0  4.0  4.5 87.5  1.0

To do this using pipe :

library(dplyr)
library(stringr)
library(purrr)

df %>%
  mutate(NewYear = str_extract_all(Years, '\\d+'), 
         NewYear = map_dbl(NewYear, ~mean(as.numeric(.x))))


#                          Years NewYear
#1                   0 to 1 year     0.5
#2                         5 yrs     5.0
#3                  About 10 yrs    10.0
#4                        20 yrs    20.0
#5                         4 yrs     4.0
#6                       4-5 yrs     4.5
#7           75 to 100 YEARS old    87.5
#8 >1 yearsmispelled or whatever     1.0

Upvotes: 2

Chuck P
Chuck P

Reputation: 3923

Although I know solutions are available using other regular expressions and with nothing more than base tools, I will present a tidyverse solution. Trivial to substitute in many other numeric operators such as median, and this solution will work no matter how many numbers are found.

library(dplyr)
library(stringr)

df <- df %>% 
  mutate(
    NewYears = str_extract_all(Years, "[0-9]+") %>%
      purrr::map( 
        ~ifelse(
           length(.x) == 1, 
           as.numeric(.x), 
           mean(unlist(as.numeric(.x)))
        )
      )
  )

df
#>                           Years NewYears
#> 1                   0 to 1 year      0.5
#> 2                         5 yrs        5
#> 3                  About 10 yrs       10
#> 4                        20 yrs       20
#> 5                         4 yrs        4
#> 6                       4-5 yrs      4.5
#> 7           75 to 100 YEARS old     87.5
#> 8 >1 yearsmispelled or whatever        1

Upvotes: 4

Related Questions