Reputation: 133
There are few issues I have encountered need help:
so I have a tsvfile like this
intermediate.tsv
experiment control par1 par2 par3
1 a_control 1 11 21
1 b_control 5 12 21
2 a_control 2 11 50
2 b_control 3 13 31
3 a_control 4 11 35
3 b_control 2 11 35
I read the tsvfile like this:
tsvfiles<- read.csv2('/path_location/intermediate.tsv', header=T, sep = '\t', stringsAsFactors = F)
convert the value to numeric
for(i in 3:ncol(tsvfiles)) {
tsvfiles[,i] <- as.numeric(tsvfiles[, i])
}
I realize when I am calling tsvfiles$control, I am only getting 'a', or 'b', which is not big deal but if someone can help me correcting that will be great
Since I just want to call par* column header, I set up a col variable
cols <- names(tsvfiles)[!names(tsvfiles) %in% c('experiment', 'control')]
Then, I want my first column to be like this:
final_data<- c('Description', 'a_control Mean', 'a_control sd', 'b_control Mean', 'b_control sd', 'plot')
And create a function
generate_table<-function(data, col){
a_mean = mean(data[[col]][tsvfiles$control == "a"])
b_mean = mean(data[[col]][tsvfiles$control == "b"])
a_sd = sd(data[[col]][tsvfiles$control == "a"])
b_sd = sd(data[[col]][tsvfiles$control == "b"])
p1 <- ggplot(data, aes(x=control, y= !!sym(col), color = control)) +
geom_violin() + geom_boxplot(width = 0.1) +
geom_jitter(shape = 16, colour = "black", alpha = 0.5, width = 0.2) +
scale_x_discrete(limits = rev(levels(as.factor(data$control)))) +
coord_flip()
column <- c(col, a_mean, a_sd, b_mean, b_sd)
return(column)
}
when I do lapply
lapply(cols, generate_table, data=tsvfiles)
I get each of this
"par1" "1.15285714285714" "0.0543270519302177" "1.2055" "0.0730879066964102"
"par2" "11.15285714285714" "1.0543270519302177" "12.2055" "3.0730879066964102"
"par3" "31.15285714285714" "5.0543270519302177" "21.2055" "2.0730879066964102"
Now here is the real question
how do I apply the output from lapply and add it to the final_data like this
Description par1 par2 par3
a_mean 1.15285714285714 11.15285714285714 31.15285714285714
a_sd 0.0543270519302177 1.0543270519302177 5.0543270519302177
b_mean 1.2055 12.2055 21.2055
b_sd 0.0730879066964102 3.0730879066964102 2.0730879066964102
This is before without even having a plot row I tried
final_data = cbind(final_data , lapply(cols, generate_table, data=tsvfiles))
and I can't get get final_data to have all the column from the function,
Thank you
Upvotes: 1
Views: 54
Reputation: 107567
Avoid looping with lapply
and consider aggregate
after reshaping your wide data to long format with reshape
. Long or tidy data is usually the preferred format in most data analytics facilitating aggregation, merge, append processes and plotting and modeling needs. Also, avoiding iteratively running plots and consider facet_wrap
on needed dimension.
# RESHAPE WIDE TO LONG
tsv_long_df <- reshape(tsvfiles, varying = names(tsvfiles)[3:ncol(tsvfiles)],
times = names(tsvfiles)[3:ncol(tsvfiles)],
v.names = "value", timevar = "par", ids = NULL,
new.row.names = 1:1E4, direction = "long")
head(tsv_long_df)
# experiment control par value
# 1 1 a_control par1 1
# 2 1 b_control par1 5
# 3 2 a_control par1 2
# 4 2 b_control par1 3
# 5 3 a_control par1 4
# 6 3 b_control par1 2
# AGGREGATE WITH MERGE FOR BOTH CONTROLS
agg_raw <- merge(aggregate(value ~ par, subset(tsv_long_df, control == "a_control"),
FUN=function(x) c(mean=mean(x), sd=sd(x))),
aggregate(value ~ par, subset(tsv_long_df, control == "b_control"),
FUN=function(x) c(mean=mean(x), sd=sd(x))),
by="par", suffixes=c("_a", "_b"))
agg_df <- do.call(data.frame, agg_raw)
Output
# DATA FRAME
agg_df <- setNames(agg_df, gsub("value_", "", names(agg_df)))
agg_df
# par a.mean a.sd b.mean b.sd
# 1 par1 2.333333 1.527525 3.333333 1.527525
# 2 par2 11.000000 0.000000 12.000000 1.000000
# 3 par3 35.333333 14.502873 29.000000 7.211103
# MATRIX (TRANSPOSED FROM ABOVE)
agg_mat <- setNames(data.frame(t(agg_df[-1])), agg_df$par)
agg_mat
# par1 par2 par3
# a.mean 2.333333 11 35.333333
# a.sd 1.527525 0 14.502873
# b.mean 3.333333 12 29.000000
# b.sd 1.527525 1 7.211103
Online Demo (using posted data)
Plot (one call using long data with facet_wrap
)
ggplot(tsv_long_df, aes(x=control, y=value, color = control)) +
geom_violin() + geom_boxplot(width = 0.1) +
geom_jitter(shape = 16, colour = "black", alpha = 0.5, width = 0.2) +
coord_flip() + facet_wrap(~par, ncol=1)
Upvotes: 2