Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

How to separate rows into columns based on variable number of pattern matches per row

I have a dataframe like this:

df <- data.frame(
  id = c("A","B"),
  date = c("31/07/2019", "31/07/2020"),
  x = c('random stuff "A":88876, more stuff',
         'something, "A":1234, more "A":456, random "A":32078, more'),
  stringsAsFactors = F
)

I'd like to create as many new columns as there are matches to a pattern; the pattern is (?<="A":)\\d+(?=,), i.e., "match the number if you see the string "A":on the left and the comma ,on the right.

The problems: (i) the number of matches may vary from row to row and (ii) the maximum number of new columns is not known in advance.

What I've done so far is this:

df[paste("A", 1:max(lengths(str_extract_all(df$x, '(?<="A":)\\d+(?=,)'))), sep = "")] <- str_extract_all(df$x, '(?<="A":)\\d+(?=,)')

While 1:max(lengths(str_extract_all(df$x, '(?<="A":)\\d+(?=,)'))) may solve the problem of unknown number of new columns, I get a warning:

`Warning message:
In `[<-.data.frame`(`*tmp*`, paste("A", 1:max(lengths(str_extract_all(df$x,  :
  replacement element 2 has 3 rows to replace 2 rows`

and the assignment of the values clearly incorrect:

df
  id       date                                                         x    A1   A2    A3
1  A 31/07/2019                        random stuff "A":88876, more stuff 88876 1234 88876
2  B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 88876  456 88876

The correct output would be this:

df
  id       date                                                         x    A1   A2    A3
1  A 31/07/2019                        random stuff "A":88876, more stuff 88876   NA    NA
2  B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more  1234  456 32078

Any idea?

Upvotes: 2

Views: 58

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173813

Here's a somewhat pedestrian stringr solution:

library(stringr)
library(dplyr)

matches <- str_extract_all(df$x, '(?<="A":)\\d+(?=,)')
ncols   <- max(sapply(matches, length))

matches %>%
  lapply(function(y)  c(y, rep(NA, ncols - length(y)))) %>%
  do.call(rbind, .) %>%
  data.frame() %>%
  setNames(paste0("A", seq(ncols))) %>%
  cbind(df, .) %>%
  tibble()
#> # A tibble: 2 x 6
#>   id    date      x                                            A1    A2    A3   
#>   <chr> <chr>     <chr>                                        <chr> <chr> <chr>
#> 1 A     31/07/20~ "random stuff \"A\":88876, more stuff"       88876 <NA>  <NA> 
#> 2 B     31/07/20~ "something, \"A\":1234, more \"A\":456, ran~ 1234  456   32078

Created on 2020-07-06 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions