Reputation: 145
I am trying to use extract
from the tidyr package to split a column, which contains a single string, into 3 separate columns. A minimum working example of this problem is given as:
# A tibble: 3 x 1
question_codes
<chr>
1 QBL10, QF14, QM20
2 QBL1, QF4
3 QM10
Where question codes QBL, QF and QM correspond to baseline, fortnightly and monthly questions. I would like to separate these each into separate columns, filling with NA where no code exists. The result should look like:
# A tibble: 3 x 4
question_codes Base Fortnightly Monthly
<chr> <chr> <chr> <chr>
1 QBL10, QF14, QM20 QBL10 QF14 QM20
2 QBL1, QF4 QBL1 QF4 NA
3 QM10 NA NA QM10
Unfortunately I am unable to get the regex to work correctly for this. I have tried using
df %>% extract(question_codes, into = c("Base", "Fortnightly", "Monthly"), "^(QBL[1-9].*)?, (QF[1-9].*)?, (QM[1-9].*)?", remove = FALSE)
but this doesn't correctly specify optional groups and results in:
# A tibble: 3 x 4
question_codes Base Fortnightly Monthly
<chr> <chr> <chr> <chr>
1 QBL10, QF14, QM20 QBL10 QF14 QM20
2 QBL1, QF4 NA NA NA
3 QM10 NA NA NA
I am very much a regex novice so any help is much appreciated!
Upvotes: 2
Views: 209
Reputation: 11584
Another approach using str_extract
.
df %>% mutate(Base = str_extract(question_codes, 'QBL\\d+'), Fortnightly = str_extract(question_codes, 'QF\\d+'), Monthly = str_extract(question_codes, 'QM\\d+'))
question_codes Base Fortnightly Monthly
1 QBL10, QF14, QM20 QBL10 QF14 QM20
2 QBL1, QF4 QBL1 QF4 <NA>
3 QM10 <NA> <NA> QM10
Upvotes: 1
Reputation: 79288
use the pattern:
pat = '(?:(QBL\\d+),)?\\s*(?:(QF\\d+),)?\\s*(QM\\d+)?'
Here all the groups are optional. ie we use ?
on each group so as to either have a 0 or 1 match
df %>%
extract(question_codes, c('a', 'b', 'c'), regex = pat,remove = FALSE)
question_codes B F M
1 QBL10, QF14, QM20 QBL10 QF14 QM20
2 QBL1, QF4 QBL1
3 QM10 QM10
Upvotes: 4