Norman Kuo
Norman Kuo

Reputation: 133

How to add columns to a dataframe through lapply

There are few issues I have encountered need help:

so I have a tsvfile like this

intermediate.tsv
experiment control   par1 par2 par3
1          a_control  1    11   21 
1          b_control  5    12   21
2          a_control  2    11   50 
2          b_control  3    13   31
3          a_control  4    11   35
3          b_control  2    11   35

I read the tsvfile like this:

tsvfiles<- read.csv2('/path_location/intermediate.tsv', header=T, sep = '\t', stringsAsFactors = F)

convert the value to numeric

for(i in 3:ncol(tsvfiles)) {
  tsvfiles[,i] <- as.numeric(tsvfiles[, i])
}

I realize when I am calling tsvfiles$control, I am only getting 'a', or 'b', which is not big deal but if someone can help me correcting that will be great

Since I just want to call par* column header, I set up a col variable

cols <- names(tsvfiles)[!names(tsvfiles) %in% c('experiment', 'control')]

Then, I want my first column to be like this:

final_data<- c('Description', 'a_control Mean', 'a_control sd',  'b_control Mean', 'b_control sd', 'plot')

And create a function

generate_table<-function(data, col){

  a_mean = mean(data[[col]][tsvfiles$control == "a"])
  b_mean = mean(data[[col]][tsvfiles$control == "b"])
  a_sd = sd(data[[col]][tsvfiles$control == "a"])
  b_sd = sd(data[[col]][tsvfiles$control == "b"])

  p1 <- ggplot(data, aes(x=control, y= !!sym(col), color = control)) + 
    geom_violin() + geom_boxplot(width = 0.1)  +
    geom_jitter(shape = 16, colour = "black", alpha = 0.5, width = 0.2) +
    scale_x_discrete(limits = rev(levels(as.factor(data$control)))) +
    coord_flip()
  column <- c(col, a_mean, a_sd, b_mean, b_sd)
  return(column)
}

when I do lapply

lapply(cols, generate_table, data=tsvfiles)

I get each of this

"par1"            "1.15285714285714"   "0.0543270519302177" "1.2055"             "0.0730879066964102" 
"par2"            "11.15285714285714"   "1.0543270519302177" "12.2055"             "3.0730879066964102" 
"par3"            "31.15285714285714"   "5.0543270519302177" "21.2055"             "2.0730879066964102" 

Now here is the real question

how do I apply the output from lapply and add it to the final_data like this

Description     par1                par2               par3
a_mean     1.15285714285714   11.15285714285714   31.15285714285714
a_sd       0.0543270519302177 1.0543270519302177  5.0543270519302177
b_mean     1.2055             12.2055             21.2055
b_sd       0.0730879066964102 3.0730879066964102  2.0730879066964102

This is before without even having a plot row I tried

final_data = cbind(final_data , lapply(cols, generate_table, data=tsvfiles))

and I can't get get final_data to have all the column from the function,

Thank you

Upvotes: 1

Views: 54

Answers (1)

Parfait
Parfait

Reputation: 107567

Avoid looping with lapply and consider aggregate after reshaping your wide data to long format with reshape. Long or tidy data is usually the preferred format in most data analytics facilitating aggregation, merge, append processes and plotting and modeling needs. Also, avoiding iteratively running plots and consider facet_wrap on needed dimension.

# RESHAPE WIDE TO LONG
tsv_long_df <- reshape(tsvfiles, varying = names(tsvfiles)[3:ncol(tsvfiles)],
                       times = names(tsvfiles)[3:ncol(tsvfiles)],
                       v.names = "value", timevar = "par", ids = NULL,
                       new.row.names = 1:1E4, direction = "long")    
head(tsv_long_df)
#   experiment   control  par value
# 1          1 a_control par1     1
# 2          1 b_control par1     5
# 3          2 a_control par1     2
# 4          2 b_control par1     3
# 5          3 a_control par1     4
# 6          3 b_control par1     2

# AGGREGATE WITH MERGE FOR BOTH CONTROLS
agg_raw <- merge(aggregate(value ~ par, subset(tsv_long_df, control == "a_control"), 
                           FUN=function(x) c(mean=mean(x), sd=sd(x))),
                 aggregate(value ~ par, subset(tsv_long_df, control == "b_control"), 
                           FUN=function(x) c(mean=mean(x), sd=sd(x))),
                 by="par", suffixes=c("_a", "_b"))

agg_df <- do.call(data.frame, agg_raw)   

Output

# DATA FRAME
agg_df <- setNames(agg_df, gsub("value_", "", names(agg_df)))
agg_df                         
#    par    a.mean      a.sd    b.mean      b.sd
# 1 par1  2.333333  1.527525  3.333333  1.527525
# 2 par2 11.000000  0.000000 12.000000  1.000000
# 3 par3 35.333333 14.502873 29.000000  7.211103

# MATRIX (TRANSPOSED FROM ABOVE)
agg_mat <- setNames(data.frame(t(agg_df[-1])), agg_df$par)
agg_mat
#            par1 par2      par3
# a.mean 2.333333   11 35.333333
# a.sd   1.527525    0 14.502873
# b.mean 3.333333   12 29.000000
# b.sd   1.527525    1  7.211103

Online Demo (using posted data)


Plot (one call using long data with facet_wrap)

ggplot(tsv_long_df, aes(x=control, y=value, color = control)) + 
  geom_violin() + geom_boxplot(width = 0.1)  +
  geom_jitter(shape = 16, colour = "black", alpha = 0.5, width = 0.2) +
  coord_flip() + facet_wrap(~par, ncol=1)

Plot Output

Upvotes: 2

Related Questions