Reputation: 7832
I am looking for a solution to expand a large data frame in R into more columns and more rows, given the values of a given column.
Right now I am doing this using a for-loop approach but I am sure there are more crantastic/efficient ways to achieve the same results...
The example will make the question more clear I think. Let us imagine we have a data frame containing student's information about their grades at three different stages in life. The student IDs are s1, s2 and s3; and we have measurements of their grades in three different times in their life, m1, m2, and m3; and then at each stage we have a column called more.info with their grades in their courses, encoded as class#topic#grade across all classes taken.
library(stringr)
options(stringsAsFactors=FALSE)
example.df = data.frame(measure.id = c("m1", "m2", "m3", "m2", "m2", "m3", "m1", "m1", "m3"),
student.id = c("s1", "s1", "s1", "s2", "s3", "s3", "s2", "s3", "s2"),
more.info = c("draw#drawing#4.5;music#singing#5.6;dance#ballet#6.7", "bio#biology#5.6;math#algebra#4.5", "calculus#univariate#6.2; physics#quantum#4.5;chemistry#organic#4.5",
"bio#biology#5.6;math#algebra#4.5", "bio#biology#3.6;math#algebra#3.5", "calculus#univariate#5.2; physics#quantum#5.2;chemistry#organic#4", "draw#drawing#5;music#singing#5.6;dance#ballet#5.7",
"draw#drawing#2.5;music#singing#3.6;dance#ballet#4", "calculus#univariate#5.2; physics#quantum#6.5;chemistry#organic#5"))
measure.ids = unique(example.df$measure.id)
Then, I'd like to create a new data frame that splits the more.info information and creates a new data frame with more rows and more columns as follows,
new.df=data.frame()
splitit <- function(x){
strsplit(x, '#')
}
for(i in 1:length(measure.ids)){
measure.id = measure.ids[i]
tmp = example.df[example.df==measure.id,]
more.info = tmp$more.info
more.info = strsplit(more.info,";")
student.ids = tmp$student.id
for(j in 1:length(more.info))
{
info = more.info[[j]]
a = sapply(info, splitit)
b = sapply(a, "[[", 1)
d = sapply(a, "[[", 2)
e = sapply(a, "[[", 3)
new.df = rbind(new.df,
data.frame(measure.id = rep(measure.id, length(info)),
student.id = rep(tmp$student.id[j], length(info)),
class = b,
topic = d,
grade = e)
)
}
}
What'd be the most efficient way to achieve this in R? I am open to apply functions, map/reduce approaches, mclapply for using more cores, etc...
Upvotes: 2
Views: 255
Reputation: 193497
Perhaps you can try out two functions that I'm written, concat.split.DT
and cSplit
. Both are presently available as GitHub Gists which can easily be loaded with the "devtools" package.
library(devtools)
source_gist(6873058) # for concat.split.DT
source_gist(11380733) # for cSplit
concat.split.DT(cSplit(example.df, splitCols="more.info", sep=";", direction="long"),
splitcols="more.info", sep="#")
# measure.id student.id more.info_1 more.info_2 more.info_3
# 1: m1 s1 draw drawing 4.5
# 2: m1 s1 music singing 5.6
# 3: m1 s1 dance ballet 6.7
# 4: m2 s1 bio biology 5.6
# 5: m2 s1 math algebra 4.5
# 6: m3 s1 calculus univariate 6.2
# 7: m3 s1 physics quantum 4.5
# 8: m3 s1 chemistry organic 4.5
# 9: m2 s2 bio biology 5.6
# 10: m2 s2 math algebra 4.5
# 11: m2 s3 bio biology 3.6
# 12: m2 s3 math algebra 3.5
# 13: m3 s3 calculus univariate 5.2
# 14: m3 s3 physics quantum 5.2
# 15: m3 s3 chemistry organic 4.0
# 16: m1 s2 draw drawing 5.0
# 17: m1 s2 music singing 5.6
# 18: m1 s2 dance ballet 5.7
# 19: m1 s3 draw drawing 2.5
# 20: m1 s3 music singing 3.6
# 21: m1 s3 dance ballet 4.0
# 22: m3 s2 calculus univariate 5.2
# 23: m3 s2 physics quantum 6.5
# 24: m3 s2 chemistry organic 5.0
# measure.id student.id more.info_1 more.info_2 more.info_3
The column names
can be easily changed later on with setnames
.
Upvotes: 1
Reputation: 8021
Here is another approach using data.table
.
Basically, I put the entire data transformation procedure in one line.
# Load R package
library(data.table)
# Convert to data.table object
example.dt <- as.data.table(example.df)
# Transform the data
final.dt <- example.dt[, data.table(do.call(rbind, unlist(lapply(strsplit(x=more.info, split=";"), strsplit, "#"), recursive=FALSE))), by=c("measure.id", "student.id")]
# Rename variables
setnames(final.dt, old=c("V1", "V2", "V3"), new=c("class", "topic", "grade"))
# > final.dt
# measure.id student.id class topic grade
# 1: m1 s1 draw drawing 4.5
# 2: m1 s1 music singing 5.6
# 3: m1 s1 dance ballet 6.7
# 4: m2 s1 dance drawing 5.6
# 5: m2 s1 draw ballet 4.5
# 6: m3 s1 draw singing 5.6
# 7: m3 s1 dance drawing 4.5
# 8: m3 s1 music ballet 4.5
# 9: m2 s2 dance drawing 5.6
# 10: m2 s2 draw ballet 4.5
# 11: m2 s3 dance drawing 5.6
# 12: m2 s3 draw ballet 4.5
# 13: m3 s3 draw singing 5.6
# 14: m3 s3 dance drawing 5.6
# 15: m3 s3 music ballet 4.5
# 16: m1 s2 draw drawing 4.5
# 17: m1 s2 music singing 5.6
# 18: m1 s2 dance ballet 6.7
# 19: m1 s3 draw drawing 4.5
# 20: m1 s3 music singing 5.6
# 21: m1 s3 dance ballet 6.7
# 22: m3 s2 draw singing 5.6
# 23: m3 s2 dance drawing 6.7
# 24: m3 s2 music ballet 4.5
# measure.id student.id class topic grade
Upvotes: 2
Reputation: 4615
This answer has some approaches that might be useful for speeding up your code (e.g mclapply
and the data.table
package).
require("data.table")
require("parallel")
require("plyr")
#Note the mclapply function. If you
#are running Mac or Linux, this should be more efficient for you
list.of.dfs <- mclapply(strsplit(example.df$more.info, "; |#|;"),FUN=function(x) as.data.frame(t(x)),mc.cores=1)
combined.df <- rbind.fill(list.of.dfs)
#Use data.table for speed and efficiency.
#example.df <- data.table(cbind(example.df,combined.df))
example.df <- data.table(example.df)
example.df[,paste0(c("class","topic","grade"),
c(rep(1,3),rep(2,3),rep(3,3))):=lapply(combined.df,I)]
#delete unnecessary column
example.df[,more.info:=NULL]
#rbindlist final table (efficient way to rbind)
table1 <- example.df[,list(measure.id,student.id,class=class1,topic=topic1,grade=grade1)]
table2 <- example.df[,list(measure.id,student.id,class=class2,topic=topic2,grade=grade2)]
table3 <- example.df[,list(measure.id,student.id,class=class3,topic=topic3,grade=grade3)]
#final results
final.table <- rbindlist(list(table1,table2,table3))[!is.na(class)]
final.table
Upvotes: 1
Reputation: 4123
Solution with base functions:
# split column by all available separators
a <- strsplit(example.df$more.info, "; |#|;")
# represent each result as a matrix with 3 columns
a <- lapply(a, function(v) matrix(v, ncol=3, byrow=TRUE))
# combine all matrixes in one big matrix
aa <- do.call(rbind, a)
# create indices of rows of initial data.frame which corresponds to the created big matrix
b <- unlist(sapply(seq_along(a), function(i) rep(i, nrow(a[[i]]))))
# combine initial data.frame and created big matrix
df <- cbind(example.df[b,], aa)
# remove unnecessary columns and rename remaining ones
df <- df[,-3]
colnames(df)[3:5] <- c("class", "topic", "grade")
To increase the speed you may replace all functions of apply
family in my code with mclapply
.
I cannot compare the speed since your dataset is very small.
Upvotes: 3