S MA
S MA

Reputation: 25

R: Separate a column into rows with tricky separators

I am looking to separate a column containing text data into 2 columns, but the separator management is quite tricky and I am convinced there is a regex solution, but not well versed in it to find a way. The dataset sample is:

Obs           Message
1       "a : 3 b : 5"
2       "c : 4 a : 2 d : 9"
3       ""
4       "b : 3"

Data chunks are separated by spaces, and variables / values are separated by " : "

my attempt at doing this:

library (tidyr)
data %>%  separate(Message, sep= " : ", into = c("variable","value"))

>
Obs variable value
1      1        a   3 b
2      2        c   4 a
3      3           <NA>
4      4        b     3

needs extra steps as the variable length of the message throws off the logic.

If someone please take a look and let me know if any regex (or other approach) would help. Appreciate your input on this.

edit: adding expected output:

Obs Variable Value

1    "a"      3    
1    "b"      5    
2    "c"      4    
2    "a"      2    
2    "d"      9    
3    ""       ""   
4    "b"      3

Upvotes: 2

Views: 57

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389325

You can use separate_rows + separate.

library(dplyr)
library(tidyr)

df %>%
  separate_rows(Message, sep = '\\s(?=[a-z])') %>%
  separate(Message, c('variable', 'value'), sep = ' : ', fill = 'right', convert = TRUE)

#    Obs variable value
#  <int> <chr>    <int>
#1     1 "a"          3
#2     1 "b"          5
#3     2 "c"          4
#4     2 "a"          2
#5     2 "d"          9
#6     3 ""          NA
#7     4 "b"          3

data

df <- structure(list(Obs = 1:4, Message = c("a : 3 b : 5", "c : 4 a : 2 d : 9", 
"", "b : 3")), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 3

Related Questions