mm8
mm8

Reputation: 13

Collapse doubles in variable names

I have a matrix of gene names with expression values in different tissues. However, the analyses were performed independently and not all genes are present in all tissues. The gene lists for each tissue were simply pasted below each other. Right now it looks like this:

 GeneName   Tissue A Tissue B
Gene A  1------------
Gene B  1------------
Gene C  2-----------
Gene A ---------3
Gene D ----------2

I would like to collapse the gene name multiples so that i get a matrix like the following:

GeneName   Tissue A Tissue B
Gene A 1---------3
Gene B 1---------
Gene C 2----------
Gene D ---------2

Edit: Thanks for the answer. However, I missed adding that the gene names are a column of their own, while the row names are simply numbers 1-n. I tried to set the name column as row name row.names(mydataframe)<-mydataframe$GeneName, but got the following error message Error inrow.names<-.data.frame(tmp, value = c(578L, 510L, 1707L, : duplicate 'row.names' are not allowed In addition: Warning message: non-unique values when setting 'row.names': As I understand it I can't use a column with non-unique values as row name, which seems to put me in a catch-22 if I need to name the rows after the gene name column to be able to collapse the matrix?

Upvotes: 1

Views: 112

Answers (1)

akrun
akrun

Reputation: 887291

Assuming that the the missing values are 'NA' and the 'Tissue.B' value in the output for 'Gene D' is 2, you may use

 res <- rowsum(m1, row.names(m1), na.rm=TRUE)
 is.na(res) <- res==0
 res
 #       Tissue.A Tissue.B
 #Gene A        1        3
 #Gene B        1       NA
 #Gene C        2       NA
 #Gene D       NA        2

If it is a data.frame with 'GeneName' as column

 library(dplyr)
 df1 %>%
    group_by(GeneName) %>% 
    summarise_each(funs(sum=sum(., na.rm=TRUE)))
 #    GeneName Tissue.A Tissue.B
 #1   Gene A        1        3
 #2   Gene B        1        0
 #3   Gene C        2        0
 #4   Gene D        0        2

and we can replace the 0 with NA as before.

Or using aggregate from base R

  aggregate(.~GeneName, df1, sum, na.rm=TRUE, na.action=NULL)

data

 m1 <- structure(c(1L, 1L, 2L, NA, NA, NA, NA, NA, 3L, 2L), .Dim = c(5L, 
 2L), .Dimnames = list(c("Gene A", "Gene B", "Gene C", "Gene A", 
"Gene D"), c("Tissue.A", "Tissue.B")))

 df1 <- structure(list(GeneName = c("Gene A", "Gene B", "Gene C",
  "Gene A", 
 "Gene D"), Tissue.A = c(1L, 1L, 2L, NA, NA), Tissue.B = c(NA, 
 NA, NA, 3L, 2L)), .Names = c("GeneName", "Tissue.A", "Tissue.B"
 ), class = "data.frame", row.names = c(NA, -5L))

Upvotes: 3

Related Questions