How to Duplicate Rows Based on Character in a String of Multiple Columns

Question

I have a data frame like the below which contains commas in columns x & y:

df <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5, x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'), y=c('wine', 'wine', 'juice', 'wine,beer,juice', 'beer,juice'))

df
  var1 var2 var3                  x               y
1    a    f    1              apple            wine
2    b    g    2       orange,apple            wine
3    c    h    3              grape           juice
4    d    i    4 apple,orange,grape wine,beer,juice
5    e    j    5       cherry,peach      beer,juice

What is the simplest way to get it to look like this:

dfnew                   
    var1    var2    var3    x       y
    a       f       1       apple   wine
    b       g       2       orange  wine
    b       g       2       apple   NA
    c       h       3       grape   juice
    d       i       4       apple   wine
    d       i       4       orange  beer
    d       i       4       grape   juice
    e       j       5       cherry  beer
    e       j       5       peach   juice

I have seen similar questions, however, while i am using 3 columns in my example, my real data has many. I need something that will take all the columns but x & y and replicate and then put the "," in tabular form like my desired outcome.

Jaap · Accepted Answer

A solution in base R:

# split the 'x' & 'y' columns in lists
xl <- strsplit(as.character(df$x), ',')
yl <- strsplit(as.character(df$y), ',')

# get the maximum length of the strings for each row
reps <- pmax(lengths(xl), lengths(yl))

# replicate the rows of 'df' by the vector of maximum string lengths
df2 <- df[rep(1:nrow(df), reps), 1:3]

# add NA-values for when the length of the strings in 'df' is shorter than
# the maximum length (which is stored in the 'reps'-vector)
# unlist & add to 'df2'
df2$x <- unlist(mapply(function(x,y) c(x, rep(NA, y)), xl, reps - lengths(xl)))
df2$y <- unlist(mapply(function(x,y) c(x, rep(NA, y)), yl, reps - lengths(yl)))

which gives:

> df2
    var1 var2 var3      x     y
1      a    f    1  apple  wine
2      b    g    2 orange  wine
2.1    b    g    2  apple  
3      c    h    3  grape juice
4      d    i    4  apple  wine
4.1    d    i    4 orange  beer
4.2    d    i    4  grape juice
5      e    j    5 cherry  beer
5.1    e    j    5  peach juice

How to Duplicate Rows Based on Character in a String of Multiple Columns

Answers (2)

Related Questions