Viktor Ek
Viktor Ek

Reputation: 353

Data rows, variables and values get reordered during computations

I have a problem with R when adding my data to a matrix prior to drawing:

> resFile <- read.csv("file.csv")
> print(resFile)
     Gene Virus  Expression    Percentage
1    ga   1Virus 2.738598e-02  38.590745
2    ga   2Virus 3.247252e-02  64.331929
3    ga   PIC    4.235604e-02  114.348940
4    ga   MOCK   1.976032e-02  0.000000        
> samples <- unique(resFile$Virus)
> genes <- unique(resFile$Gene)
> numGene <- length(genes)
> numSmpl <- length(samples)

> mat <- matrix(ncol=numGene,nrow=numSmpl,dimnames=list(samples,genes))
> mat[samples,genes]<-resFile$Percentage
> print(mat)
              ga 
1Virus  38.59074
2Virus  64.33193
PIC      0.00000
MOCK   114.34894

As you can see, the percentage values are switched between my PIC and MOCK samples. This happens to whole columns as well and it looks like the values are added in alphabetical order instead of indicially.

Why is this happening and how can I get around it?

Upvotes: 1

Views: 61

Answers (2)

smci
smci

Reputation: 33960

You only ever got the weirdness because you read in the CSV with the irritating default stringsAsFactors() = TRUE. Hence all your string columns become factors, and moreover they use the default factor(..., ordered=F). You could read them in as strings, then convert to factor(..., ordered=T) if you liked.

Then, whenever you see someone construct a matrix/vector from unique(df$factorCol), as opposed to labels(), you revisit that ordering issue, unless the factor is ordered.

In your case you don't even need to create a matrix, you could get your x,y-series directly from a dataframe slice resFile[, c('Virus','Percentage')]

resFile <- read.csv("res.csv", stringsAsFactors=F)

resFile[, c('Virus','Percentage')]
   Virus Percentage
1 1Virus   38.59074
2 2Virus   64.33193
3    PIC  114.34894
4   MOCK    0.00000

> as.matrix(resFile[, c('Virus','Percentage')])
     Virus    Percentage 
[1,] "1Virus" " 38.59074"
[2,] "2Virus" " 64.33193"
[3,] "PIC"    "114.34894"
[4,] "MOCK"   "  0.00000"
# Creating a matrix from slices of dataframe isn't desirable, not just for the row-ordering, but also because all entries are coerced to string. So just don't do it.

Now if you wanted in general to get a group for each Gene, then select just the Virus, Percentage columns, use dplyr:

> require(dplyr)
> ga_slice <- resFile %>% group_by(Gene) %>% select(Virus,Percentage) %>% ungroup() %>% select(-Gene)
Source: local data frame [4 x 2]

   Virus Percentage
1 1Virus   38.59074
2 2Virus   64.33193
3    PIC  114.34894
4   MOCK    0.00000

Upvotes: 1

Miff
Miff

Reputation: 7941

You've got a couple of problems here with the line:

mat[samples,genes] <- resFile$Percentage

Firstly, if samples and genes are taken from resFile, they'll probably be factors rather than strings, so if the order of samples or genes differs from the order of the factor levels you'll get the rows or columns shuffled.

Secondly, this assigns the resFile$Percentage to all the subset of mat where the row name is in samples and the column name in genes, rather than taking them pairwise.

To get round these problems, try the following (I'm making assumptions about how you generated samples and genes:

resFile <- data.frame(Gene="ga",
                      Virus=c("1Virus","2Virus","PIC","MOCK"),          
                      Percentage=c(38.59,64.33,114.34,0))
samples <- unique(resFile$Virus)
genes <- unique(resFile$Gene)
numGene <- length(genes)
numSmpl <- length(samples)
mat <- matrix(ncol=numGene,nrow=numSmpl,dimnames=list(samples,genes))

mat[cbind(as.character(resFile$Virus)),as.character(resFile$Gene)] <- resFile$Percentage
mat
#            ga
# 1Virus  38.59
# 2Virus  64.33
# PIC    114.34
# MOCK     0.00

The key differences are that I've converted the factor variables to character, and indexed using a matrix rather than two vectors - see ?'[' for a better explanation of indexing by arrays than I can manage.

Upvotes: 2

Related Questions