user1701545
user1701545

Reputation: 6200

Splitting data frame rows with concatenated values

I have a data.frame that looks like this:

df <- data.frame(col1=c("a","b","c","d"), col2=c("1","1;2;3","5","3;2;5;5;3"), col3=c("0","1;1;0","0","0;0;1;1;0"))

#   col1      col2      col3
# 1    a         1         0
# 2    b     1;2;3     1;1;0
# 3    c         5         0
# 4    d 3;2;5;5;3 0;0;1;1;0

In words, some rows have columns with values concatenated by ";". Before reading in the data.frame I don't know which columns will contain the concatenated values but I do know that they will be the same ones for all those rows that have that. I also know that for a rows with columns with concatenated values the number of concatenated values is identical across all such columns (row 2 has 3 values in both col2 and col3, and row 4 has 5 values in these columns)

I would like to create a new data.frame where these concatenated values are split to separate rows. For those rows, the values in the columns that do not have concatenated values should be replicated by the number of concatenated values.

The resulting data.frame will be:

df <- data.frame(col1=c("a","b","b","b","c","d","d","d","d","d"), col2=c("1","1","2","3","5","3","2","5","5","3"), col3=c("0","1","1","0","0","0","0","1","1","0"))

#    col1 col2 col3
# 1     a    1    0
# 2     b    1    1
# 3     b    2    1
# 4     b    3    0
# 5     c    5    0
# 6     d    3    0
# 7     d    2    0
# 8     d    5    1
# 9     d    5    1
# 10    d    3    0

Upvotes: 2

Views: 76

Answers (3)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193527

This is the kind of data that I wrote my "splitstackshape" package for. You can use cSplit, like this:

library(splitstackshape)
cSplit(df, c("col2", "col3"), ";", "long")
#     col1 col2 col3
#  1:    a    1    0
#  2:    b    1    1
#  3:    b    2    1
#  4:    b    3    0
#  5:    c    5    0
#  6:    d    3    0
#  7:    d    2    0
#  8:    d    5    1
#  9:    d    5    1
# 10:    d    3    0

Upvotes: 1

Henry
Henry

Reputation: 6784

Not as sophisticated as rawr's answer, but perhaps easier to see what is happening

df1 <- data.frame(col1=c("a","b","c","d"), 
                  col2=c("1","1;2;3","5","3;2;5;5;3"), 
                  col3=c("0","1;1;0","0","0;0;1;1;0"),
                  stringsAsFactors=FALSE)

df1_rows   <- nrow(df1)
col1_split <- strsplit(df1$col1,";") 
col2_split <- strsplit(df1$col2,";") 
col3_split <- strsplit(df1$col3,";") 

df2 <- data.frame(col1=character(), 
                  col2=character(), 
                  col3=character(),
                  stringsAsFactors=FALSE) 

for (n in 1:df1_rows){ df2 <- rbind(df2, 
       data.frame(col1=col1_split[[n]],
                  col2=col2_split[[n]],
                  col3=col3_split[[n]], 
                  stringsAsFactors=FALSE))}

which gives

> df2 
   col1 col2 col3
1     a    1    0
2     b    1    1
3     b    2    1
4     b    3    0
5     c    5    0
6     d    3    0
7     d    2    0
8     d    5    1
9     d    5    1
10    d    3    0

Upvotes: 0

rawr
rawr

Reputation: 20811

here's one option

df <- data.frame(col1=c("a","b","c","d"), col2=c("1","1;2;3","5","3;2;5;5;3"), col3=c("0","1;1;0","0","0;0;1;1;0"))

df2 <- data.frame(col1=c("a","b","b","b","c","d","d","d","d","d"), col2=c("1","1","2","3","5","3","2","5","5","3"), col3=c("0","1","1","0","0","0","0","1","1","0"))


## reshape `col1` to make it look like the others
v <- Vectorize(gsub)
df$col1 <- v('\\b\\d\\b', df$col1, df$col2)

#        col1      col2      col3
# 1         a         1         0
# 2     b;b;b     1;2;3     1;1;0
# 3         c         5         0
# 4 d;d;d;d;d 3;2;5;5;3 0;0;1;1;0


## split on white space or `;` and coerce back into a data frame
data.frame(do.call('cbind', lapply(df, function(x)
  unlist(strsplit(as.character(x), '[\\s;]')))))

#    col1 col2 col3
# 1     a    1    0
# 2     b    1    1
# 3     b    2    1
# 4     b    3    0
# 5     c    5    0
# 6     d    3    0
# 7     d    2    0
# 8     d    5    1
# 9     d    5    1
# 10    d    3    0

Upvotes: 2

Related Questions