alexb523
alexb523

Reputation: 728

How to Duplicate Rows Based on Character in a String of Multiple Columns

I have a data frame like the below which contains commas in columns x & y:

df <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5, x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'), y=c('wine', 'wine', 'juice', 'wine,beer,juice', 'beer,juice'))

df
  var1 var2 var3                  x               y
1    a    f    1              apple            wine
2    b    g    2       orange,apple            wine
3    c    h    3              grape           juice
4    d    i    4 apple,orange,grape wine,beer,juice
5    e    j    5       cherry,peach      beer,juice

What is the simplest way to get it to look like this:

dfnew                   
    var1    var2    var3    x       y
    a       f       1       apple   wine
    b       g       2       orange  wine
    b       g       2       apple   NA
    c       h       3       grape   juice
    d       i       4       apple   wine
    d       i       4       orange  beer
    d       i       4       grape   juice
    e       j       5       cherry  beer
    e       j       5       peach   juice

I have seen similar questions, however, while i am using 3 columns in my example, my real data has many. I need something that will take all the columns but x & y and replicate and then put the "," in tabular form like my desired outcome.

Upvotes: 2

Views: 875

Answers (2)

Uwe
Uwe

Reputation: 42544

In the original data.frame, there was a 1:1 relationship between the list elements in x and y in the same rows. So, after splitting there was the same number of elements in x and y. This "symmetric" structure allows us to split both columns simultaneously:

# original data.frame, "symmetric" data
df1 <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5, 
                  x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'), 
                  y=c('wine', 'wine,beer', 'juice', 'wine,beer,juice', 'beer,juice'))

library(data.table)   # CRAN version 1.10.4 used
# define columns to be splitted
sp_col <- c("x", "y")
# define id columns
id_col <- paste0("var", 1:3)
# coerce to class data.table, 
# convert sp_col from factor to character which is required by strsplit(),
# then split up all columns _not_ used for grouping,
# turn the result into vectors, but for each column separately. 
setDT(df1)[, (sp_col) := lapply(.SD, as.character), .SDcols = sp_col][
  , unlist(lapply(.SD, strsplit, split = ",", fixed = TRUE), recursive = FALSE), by = id_col]

which yields

   var1 var2 var3      x     y
1:    a    f    1  apple  wine
2:    b    g    2 orange  wine
3:    b    g    2  apple  beer
4:    c    h    3  grape juice
5:    d    i    4  apple  wine
6:    d    i    4 orange  beer
7:    d    i    4  grape juice
8:    e    j    5 cherry  beer
9:    e    j    5  peach juice

Edit: With the edited data.frame, the OP has requested to fill up missing positions by NA which requires a different approach. For this melt() and dcast() are used.

# data.frame updated by OP, "unsymmetric" data
df2 <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5, 
                  x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'), 
                  y=c('wine', 'wine', 'juice', 'wine,beer,juice', 'beer,juice'))

Please note the change in row 2 of column y.

library(data.table)   # CRAN version 1.10.4 used
# define columns to be splitted
sp_col <- c("x", "y")
# coerce to class data.table, add column with row numbers
# reshape from wide to long format
long <- melt(setDT(df2)[, rn := .I], measure.vars = sp_col)
# split value column, grouped by all other columns
# reshape from long to wide format where the rows are formed by
# an individual count by row number and variable + all other id cols,
# finally remove the row numbers as this is no longer needed
dcast(long[, strsplit(value, ",", fixed = TRUE), by = setdiff(names(long), "value")], 
      ... + rowid(rn, variable) ~ variable , value.var = "V1")[
        , rn := NULL][]

(thanks to @Jaap for suggesting improvements)

produces NAs requested:

   var1 var2 var3      x     y
1:    a    f    1  apple  wine
2:    b    g    2 orange  wine
3:    b    g    2  apple    NA
4:    c    h    3  grape juice
5:    d    i    4  apple  wine
6:    d    i    4 orange  beer
7:    d    i    4  grape juice
8:    e    j    5 cherry  beer
9:    e    j    5  peach juice

Upvotes: 3

Jaap
Jaap

Reputation: 83215

A solution in base R:

# split the 'x' & 'y' columns in lists
xl <- strsplit(as.character(df$x), ',')
yl <- strsplit(as.character(df$y), ',')

# get the maximum length of the strings for each row
reps <- pmax(lengths(xl), lengths(yl))

# replicate the rows of 'df' by the vector of maximum string lengths
df2 <- df[rep(1:nrow(df), reps), 1:3]

# add NA-values for when the length of the strings in 'df' is shorter than
# the maximum length (which is stored in the 'reps'-vector)
# unlist & add to 'df2'
df2$x <- unlist(mapply(function(x,y) c(x, rep(NA, y)), xl, reps - lengths(xl)))
df2$y <- unlist(mapply(function(x,y) c(x, rep(NA, y)), yl, reps - lengths(yl)))

which gives:

> df2
    var1 var2 var3      x     y
1      a    f    1  apple  wine
2      b    g    2 orange  wine
2.1    b    g    2  apple  <NA>
3      c    h    3  grape juice
4      d    i    4  apple  wine
4.1    d    i    4 orange  beer
4.2    d    i    4  grape juice
5      e    j    5 cherry  beer
5.1    e    j    5  peach juice

Upvotes: 2

Related Questions