Reputation: 21400
I have a dataframe like this:
df <- data.frame(
id = c("A","B"),
date = c("31/07/2019", "31/07/2020"),
x = c('random stuff "A":88876, more stuff',
'something, "A":1234, more "A":456, random "A":32078, more'),
stringsAsFactors = F
)
I'd like to create as many new columns as there are matches to a pattern; the pattern is (?<="A":)\\d+(?=,)
, i.e., "match the number if you see the string "A":
on the left and the comma ,
on the right.
The problems: (i) the number of matches may vary from row to row and (ii) the maximum number of new columns is not known in advance.
What I've done so far is this:
df[paste("A", 1:max(lengths(str_extract_all(df$x, '(?<="A":)\\d+(?=,)'))), sep = "")] <- str_extract_all(df$x, '(?<="A":)\\d+(?=,)')
While 1:max(lengths(str_extract_all(df$x, '(?<="A":)\\d+(?=,)')))
may solve the problem of unknown number of new columns, I get a warning:
`Warning message:
In `[<-.data.frame`(`*tmp*`, paste("A", 1:max(lengths(str_extract_all(df$x, :
replacement element 2 has 3 rows to replace 2 rows`
and the assignment of the values clearly incorrect:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 1234 88876
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 88876 456 88876
The correct output would be this:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 NA NA
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 1234 456 32078
Any idea?
Upvotes: 2
Views: 58
Reputation: 173813
Here's a somewhat pedestrian stringr
solution:
library(stringr)
library(dplyr)
matches <- str_extract_all(df$x, '(?<="A":)\\d+(?=,)')
ncols <- max(sapply(matches, length))
matches %>%
lapply(function(y) c(y, rep(NA, ncols - length(y)))) %>%
do.call(rbind, .) %>%
data.frame() %>%
setNames(paste0("A", seq(ncols))) %>%
cbind(df, .) %>%
tibble()
#> # A tibble: 2 x 6
#> id date x A1 A2 A3
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 31/07/20~ "random stuff \"A\":88876, more stuff" 88876 <NA> <NA>
#> 2 B 31/07/20~ "something, \"A\":1234, more \"A\":456, ran~ 1234 456 32078
Created on 2020-07-06 by the reprex package (v0.3.0)
Upvotes: 1