user3354212
user3354212

Reputation: 1112

how to merge columns by group to create a new data frame in r

I would like to ask experts to help me to combine different number of columns by group (sample). Each sample may have 1~3 replicates. I need to merge data from replicates into samples. For each row, Within each sample if the character values across all replicates are the same, just take one; if different, ignoring "NC" or "AB" when "AA" or "BB" exist; if only "NC" and "AB" exist in the replicates, take "AB" ignoring "NC"; if "AA" and "BB" exist in different replicates, give "BA" to that data point. the data frame is:

df = read.table(text="ID    SP01_1  SP01_2  SP02_1  SP02_2  SP03_1  SP03_2  SP03_3
T001    AA  AA  AA  AB  AA  AB  AA
T002    AA  AB  BB  AA  AA  AA  AA
T003    AB  AB  BB  BB  AA  BB  BB
T004    NC  NC  NC  BB  BB  NC  BB
T005    BB  AB  BB  NC  AA  BB  AB
T006    AB  BB  BB  BB  NC  AB  BB
T007    AA  NC  AA  BB  AA  AA  BB
T008    AB  BB  BB  AB  BB  BB  BB", header=T, stringsAsFactors=F)

the result is expected as:

ID  SP01    SP02    SP03
T001    AA  AA  AA
T002    AA  BA  AA
T003    AB  BB  BA
T004    NC  BB  BB
T005    BB  BB  BA
T006    BB  BB  BB
T007    AA  BA  BA
T008    BB  BB  BB

Upvotes: 2

Views: 1252

Answers (1)

kdauria
kdauria

Reputation: 6711

There are multiple ways to do this. I like using the reshape package.

# Rearrange the data
library(reshape)
mdf = melt(df,id.var="ID")
mdf$variable = substr(mdf$variable,1,4)
mdf$value = as.character(mdf$value)
head(mdf)
#     ID variable value
# 1 T001     SP01    AA
# 2 T002     SP01    AA
# 3 T003     SP01    AB
# 4 T004     SP01    NC
# 5 T005     SP01    BB
# 6 T006     SP01    AB

# The function to aggregate the replicates
foo = function(reps) {
  if( "AA" %in% reps && "BB" %in% reps ) {
    return("BA")
  } else {
    matches = c("AA","BB","AB","NC")
    return( matches[min(match(reps,matches))] )
  }
}

All the hard work is now done.

# A function to "cast" the melted data back into a data frame
cast(mdf, ID ~ variable, fun.aggregate=foo )
#     ID SP01 SP02 SP03
# 1 T001   AA   AA   AA
# 2 T002   AA   BA   AA
# 3 T003   AB   BB   BA
# 4 T004   NC   BB   BB
# 5 T005   BB   BB   BA
# 6 T006   BB   BB   BB
# 7 T007   AA   BA   BA
# 8 T008   BB   BB   BB

Upvotes: 1

Related Questions