Oska Fentem
Oska Fentem

Reputation: 145

Tidyr separate with optional groups

I am trying to use extract from the tidyr package to split a column, which contains a single string, into 3 separate columns. A minimum working example of this problem is given as:

# A tibble: 3 x 1
  question_codes   
  <chr>            
1 QBL10, QF14, QM20
2 QBL1, QF4        
3 QM10  

Where question codes QBL, QF and QM correspond to baseline, fortnightly and monthly questions. I would like to separate these each into separate columns, filling with NA where no code exists. The result should look like:

# A tibble: 3 x 4
  question_codes    Base  Fortnightly Monthly
  <chr>             <chr> <chr>       <chr>  
1 QBL10, QF14, QM20 QBL10 QF14        QM20   
2 QBL1, QF4         QBL1  QF4         NA     
3 QM10              NA    NA          QM10 

Unfortunately I am unable to get the regex to work correctly for this. I have tried using df %>% extract(question_codes, into = c("Base", "Fortnightly", "Monthly"), "^(QBL[1-9].*)?, (QF[1-9].*)?, (QM[1-9].*)?", remove = FALSE)

but this doesn't correctly specify optional groups and results in:

# A tibble: 3 x 4
  question_codes    Base  Fortnightly Monthly
  <chr>             <chr> <chr>       <chr>  
1 QBL10, QF14, QM20 QBL10 QF14        QM20   
2 QBL1, QF4         NA    NA          NA     
3 QM10              NA    NA          NA 

I am very much a regex novice so any help is much appreciated!

Upvotes: 2

Views: 209

Answers (2)

Karthik S
Karthik S

Reputation: 11584

Another approach using str_extract.

df %>% mutate(Base = str_extract(question_codes, 'QBL\\d+'), Fortnightly = str_extract(question_codes, 'QF\\d+'), Monthly = str_extract(question_codes, 'QM\\d+'))
     question_codes  Base Fortnightly Monthly
1 QBL10, QF14, QM20 QBL10        QF14    QM20
2         QBL1, QF4  QBL1         QF4    <NA>
3              QM10  <NA>        <NA>    QM10

Upvotes: 1

Onyambu
Onyambu

Reputation: 79288

use the pattern:

 pat = '(?:(QBL\\d+),)?\\s*(?:(QF\\d+),)?\\s*(QM\\d+)?'

Here all the groups are optional. ie we use ? on each group so as to either have a 0 or 1 match


df %>%
     extract(question_codes, c('a', 'b', 'c'), regex = pat,remove = FALSE)

    question_codes     B    F    M
1 QBL10, QF14, QM20 QBL10 QF14 QM20
2         QBL1, QF4  QBL1          
3              QM10            QM10

Upvotes: 4

Related Questions