Reputation: 728
I have a data frame like the below which contains commas in columns x
& y
:
df <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5, x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'), y=c('wine', 'wine', 'juice', 'wine,beer,juice', 'beer,juice'))
df
var1 var2 var3 x y
1 a f 1 apple wine
2 b g 2 orange,apple wine
3 c h 3 grape juice
4 d i 4 apple,orange,grape wine,beer,juice
5 e j 5 cherry,peach beer,juice
What is the simplest way to get it to look like this:
dfnew
var1 var2 var3 x y
a f 1 apple wine
b g 2 orange wine
b g 2 apple NA
c h 3 grape juice
d i 4 apple wine
d i 4 orange beer
d i 4 grape juice
e j 5 cherry beer
e j 5 peach juice
I have seen similar questions, however, while i am using 3 columns in my example, my real data has many. I need something that will take all the columns but x
& y
and replicate and then put the "," in tabular form like my desired outcome.
Upvotes: 2
Views: 875
Reputation: 42544
In the original data.frame, there was a 1:1 relationship between the list elements in x
and y
in the same rows. So, after splitting there was the same number of elements in x
and y
. This "symmetric" structure allows us to split both columns simultaneously:
# original data.frame, "symmetric" data
df1 <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5,
x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'),
y=c('wine', 'wine,beer', 'juice', 'wine,beer,juice', 'beer,juice'))
library(data.table) # CRAN version 1.10.4 used
# define columns to be splitted
sp_col <- c("x", "y")
# define id columns
id_col <- paste0("var", 1:3)
# coerce to class data.table,
# convert sp_col from factor to character which is required by strsplit(),
# then split up all columns _not_ used for grouping,
# turn the result into vectors, but for each column separately.
setDT(df1)[, (sp_col) := lapply(.SD, as.character), .SDcols = sp_col][
, unlist(lapply(.SD, strsplit, split = ",", fixed = TRUE), recursive = FALSE), by = id_col]
which yields
var1 var2 var3 x y
1: a f 1 apple wine
2: b g 2 orange wine
3: b g 2 apple beer
4: c h 3 grape juice
5: d i 4 apple wine
6: d i 4 orange beer
7: d i 4 grape juice
8: e j 5 cherry beer
9: e j 5 peach juice
Edit: With the edited data.frame, the OP has requested to fill up missing positions by NA which requires a different approach. For this melt()
and dcast()
are used.
# data.frame updated by OP, "unsymmetric" data
df2 <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5,
x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'),
y=c('wine', 'wine', 'juice', 'wine,beer,juice', 'beer,juice'))
Please note the change in row 2 of column y
.
library(data.table) # CRAN version 1.10.4 used
# define columns to be splitted
sp_col <- c("x", "y")
# coerce to class data.table, add column with row numbers
# reshape from wide to long format
long <- melt(setDT(df2)[, rn := .I], measure.vars = sp_col)
# split value column, grouped by all other columns
# reshape from long to wide format where the rows are formed by
# an individual count by row number and variable + all other id cols,
# finally remove the row numbers as this is no longer needed
dcast(long[, strsplit(value, ",", fixed = TRUE), by = setdiff(names(long), "value")],
... + rowid(rn, variable) ~ variable , value.var = "V1")[
, rn := NULL][]
(thanks to @Jaap for suggesting improvements)
produces NAs requested:
var1 var2 var3 x y
1: a f 1 apple wine
2: b g 2 orange wine
3: b g 2 apple NA
4: c h 3 grape juice
5: d i 4 apple wine
6: d i 4 orange beer
7: d i 4 grape juice
8: e j 5 cherry beer
9: e j 5 peach juice
Upvotes: 3
Reputation: 83215
A solution in base R:
# split the 'x' & 'y' columns in lists
xl <- strsplit(as.character(df$x), ',')
yl <- strsplit(as.character(df$y), ',')
# get the maximum length of the strings for each row
reps <- pmax(lengths(xl), lengths(yl))
# replicate the rows of 'df' by the vector of maximum string lengths
df2 <- df[rep(1:nrow(df), reps), 1:3]
# add NA-values for when the length of the strings in 'df' is shorter than
# the maximum length (which is stored in the 'reps'-vector)
# unlist & add to 'df2'
df2$x <- unlist(mapply(function(x,y) c(x, rep(NA, y)), xl, reps - lengths(xl)))
df2$y <- unlist(mapply(function(x,y) c(x, rep(NA, y)), yl, reps - lengths(yl)))
which gives:
> df2 var1 var2 var3 x y 1 a f 1 apple wine 2 b g 2 orange wine 2.1 b g 2 apple <NA> 3 c h 3 grape juice 4 d i 4 apple wine 4.1 d i 4 orange beer 4.2 d i 4 grape juice 5 e j 5 cherry beer 5.1 e j 5 peach juice
Upvotes: 2