Data rows, variables and values get reordered during computations

Question

I have a problem with R when adding my data to a matrix prior to drawing:

> resFile <- read.csv("file.csv")
> print(resFile)
     Gene Virus  Expression    Percentage
1    ga   1Virus 2.738598e-02  38.590745
2    ga   2Virus 3.247252e-02  64.331929
3    ga   PIC    4.235604e-02  114.348940
4    ga   MOCK   1.976032e-02  0.000000        
> samples <- unique(resFile$Virus)
> genes <- unique(resFile$Gene)
> numGene <- length(genes)
> numSmpl <- length(samples)

> mat <- matrix(ncol=numGene,nrow=numSmpl,dimnames=list(samples,genes))
> mat[samples,genes]<-resFile$Percentage
> print(mat)
              ga 
1Virus  38.59074
2Virus  64.33193
PIC      0.00000
MOCK   114.34894

As you can see, the percentage values are switched between my PIC and MOCK samples. This happens to whole columns as well and it looks like the values are added in alphabetical order instead of indicially.

Why is this happening and how can I get around it?

Miff · Accepted Answer

You've got a couple of problems here with the line:

mat[samples,genes] <- resFile$Percentage

Firstly, if samples and genes are taken from resFile, they'll probably be factors rather than strings, so if the order of samples or genes differs from the order of the factor levels you'll get the rows or columns shuffled.

Secondly, this assigns the resFile$Percentage to all the subset of mat where the row name is in samples and the column name in genes, rather than taking them pairwise.

To get round these problems, try the following (I'm making assumptions about how you generated samples and genes:

resFile <- data.frame(Gene="ga",
                      Virus=c("1Virus","2Virus","PIC","MOCK"),          
                      Percentage=c(38.59,64.33,114.34,0))
samples <- unique(resFile$Virus)
genes <- unique(resFile$Gene)
numGene <- length(genes)
numSmpl <- length(samples)
mat <- matrix(ncol=numGene,nrow=numSmpl,dimnames=list(samples,genes))

mat[cbind(as.character(resFile$Virus)),as.character(resFile$Gene)] <- resFile$Percentage
mat
#            ga
# 1Virus  38.59
# 2Virus  64.33
# PIC    114.34
# MOCK     0.00

The key differences are that I've converted the factor variables to character, and indexed using a matrix rather than two vectors - see ?'[' for a better explanation of indexing by arrays than I can manage.

Data rows, variables and values get reordered during computations

Answers (2)

Related Questions