Reputation: 4790

Split sequence dataframe in R

I have a dataframe of sequences like this

dput(df)
structure(list(val = structure(c(3L, 2L, 4L, 1L, 5L, 6L), .Label = c("{36415},{36415}", 
                           "{36415},{85610}", "{36415},{9904}", "{85025,36415}", "{85610},{36415}", 
                           "{8872},{36415}"), class = "factor")), .Names = "val", row.names = c(NA, 
                                                                                                -6L), class = "data.frame")

df
              val
1  {36415},{9904}
2 {36415},{85610}
3   {85025,36415}
4 {36415},{36415}
5 {85610},{36415}
6  {8872},{36415}

Notice the 3rd row above. The first row says there is a sequence item 1 followed by item 2 in different rows. The 3rd row says item1 and 2 belong to same row in the sequence

I want to break this data frame into columns like this

col1        col2
36415       9904
36415       85610
85025,36415 NA
36415       36415
...

Notice how the 3rd row of the data frame is.

Is there any way to achieve this?

Upvotes: 1

Answers (3)

acylam

Reputation: 18691

Here's a one-liner with extract from tidyr. This uses capture groups to specify the column patterns:

library(tidyr)

extract(df, "val", c("col1", "col2"), regex = "\\{([\\d,]+)(?:\\},\\{)?([\\d,]+)?\\}")

or with str_match from stringr. This uses the exact same regex:

library(stringr)

data.frame(str_match(df$val, "\\{([\\d,]+)(?:\\},\\{)?([\\d,]+)?\\}")[,-1])

Result:

         col1  col2
1       36415  9904
2       36415 85610
3 85025,36415  <NA>
4       36415 36415
5       85610 36415
6        8872 36415

           X1    X2
1       36415  9904
2       36415 85610
3 85025,36415  <NA>
4       36415 36415
5       85610 36415
6        8872 36415

Upvotes: 1

www

Reputation: 39174

A solution from dplyr and tidyr. We can separate the column and then remove any { or }.

library(dplyr)
library(tidyr)
df2 <- df %>%
  separate(val, into = c("col1", "col2"), sep = "\\},\\{", fill = "right") %>%
  mutate_all(funs(gsub("\\{|\\}", "", .)))
df2
#          col1  col2
# 1       36415  9904
# 2       36415 85610
# 3 85025,36415  <NA>
# 4       36415 36415
# 5       85610 36415
# 6        8872 36415

Upvotes: 1

Gregor Thomas

Reputation: 146070

library(tidyr)
df = separate(df, col = val, into = c("col1", "col2"), sep = "\\},\\{", fill = "right")
df[] = lapply(df, gsub, pattern = "\\{|\\}", replacement = "")
df
#          col1  col2
# 1       36415  9904
# 2       36415 85610
# 3 85025,36415  <NA>
# 4       36415 36415
# 5       85610 36415
# 6        8872 36415

Upvotes: 2

Split sequence dataframe in R

Answers (3)

Related Questions