Trent
Trent

Reputation: 813

Extract all matches to a new column using regex in R

In my data, I have a column of open text field data that resembles the following sample:

d <- tribble(
  ~x,
  "i am 10 and she is 50",
  "he is 32 and i am 22",
  "he may be 70 and she may be 99",
)

I would like to use regex to extract all two digit numbers to a new column called y. I have the following code and it works well extracting the first match:

d %>%
  mutate(y = str_extract(x, "([0-9]{2})"))

# A tibble: 3 x 2
  x                              y    
  <chr>                          <chr>
1 i am 10 and she is 50          10   
2 he is 32 and i am 22           32   
3 he may be 70 and she may be 99 70 

But, is there a way to extract both two-digit numbers to the same column with some standard separator (e.g comma)?

Upvotes: 3

Views: 2286

Answers (2)

acylam
acylam

Reputation: 18701

We can also use extract and unite from tidyr:

library(dplyr)
library(tidyr)

d %>%
  extract(x, c('y', 'z'), regex = "(\\d+)[^\\d]+(\\d+)", remove = FALSE) 

Output:

# A tibble: 3 x 3
  x                              y     z    
  <chr>                          <chr> <chr>
1 i am 10 and she is 50          10    50   
2 he is 32 and i am 22           32    22   
3 he may be 70 and she may be 99 70    99 

Return single column:

d %>%
  extract(x, c('y', 'z'), regex = "(\\d+)[^\\d]+(\\d+)", remove = FALSE) %>%
  unite('y', y, z, sep = ', ')

Output:

# A tibble: 3 x 3
  x                              y     
  <chr>                          <chr> 
1 i am 10 and she is 50          10, 50
2 he is 32 and i am 22           32, 22
3 he may be 70 and she may be 99 70, 99

Upvotes: 4

akrun
akrun

Reputation: 887971

We can use str_extract_all instead of str_extract because str_extract matches only the first instance where as the _all suffix is global and would extract all the instances in a list, which can be convert back to two columns with unnest_wider

library(dplyr)
library(tidyr)
library(stringr)
d %>%  
    mutate(out =  str_extract_all(x, "\\d{2}")) %>% 
    unnest_wider(c(out)) %>%
    rename_at(-1, ~ c('y', 'z')) %>%
    type.convert(as.is = TRUE)
# A tibble: 3 x 3
# x                                  y     z
#  <chr>                          <int> <int>
#1 i am 10 and she is 50             10    50
#2 he is 32 and i am 22              32    22
#3 he may be 70 and she may be 99    70    99

If we need as a string column with , as separator, after extraction into a list, loop over the list with map and concatenate all elements to a single string with toString (wrapper for paste(., collapse=", "))

library(purrr)
d %>%
   mutate(y = str_extract_all(x, "\\b\\d{2}\\b") %>%
                 map_chr(toString))
# A tibble: 3 x 2
#  x                              y     
#  <chr>                          <chr> 
#1 i am 10 and she is 50          10, 50
#2 he is 32 and i am 22           32, 22
#3 he may be 70 and she may be 99 70, 99

Upvotes: 3

Related Questions