Reputation: 353
I have a problem with R when adding my data to a matrix prior to drawing:
> resFile <- read.csv("file.csv")
> print(resFile)
Gene Virus Expression Percentage
1 ga 1Virus 2.738598e-02 38.590745
2 ga 2Virus 3.247252e-02 64.331929
3 ga PIC 4.235604e-02 114.348940
4 ga MOCK 1.976032e-02 0.000000
> samples <- unique(resFile$Virus)
> genes <- unique(resFile$Gene)
> numGene <- length(genes)
> numSmpl <- length(samples)
> mat <- matrix(ncol=numGene,nrow=numSmpl,dimnames=list(samples,genes))
> mat[samples,genes]<-resFile$Percentage
> print(mat)
ga
1Virus 38.59074
2Virus 64.33193
PIC 0.00000
MOCK 114.34894
As you can see, the percentage
values are switched between my PIC and MOCK samples. This happens to whole columns as well and it looks like the values are added in alphabetical order instead of indicially.
Why is this happening and how can I get around it?
Upvotes: 1
Views: 61
Reputation: 33960
You only ever got the weirdness because you read in the CSV with the irritating default stringsAsFactors() = TRUE
. Hence all your string columns become factors, and moreover they use the default factor(..., ordered=F)
. You could read them in as strings, then convert to factor(..., ordered=T)
if you liked.
Then, whenever you see someone construct a matrix/vector from unique(df$factorCol)
, as opposed to labels()
, you revisit that ordering issue, unless the factor is ordered.
In your case you don't even need to create a matrix, you could get your x,y-series directly from a dataframe slice resFile[, c('Virus','Percentage')]
resFile <- read.csv("res.csv", stringsAsFactors=F)
resFile[, c('Virus','Percentage')]
Virus Percentage
1 1Virus 38.59074
2 2Virus 64.33193
3 PIC 114.34894
4 MOCK 0.00000
> as.matrix(resFile[, c('Virus','Percentage')])
Virus Percentage
[1,] "1Virus" " 38.59074"
[2,] "2Virus" " 64.33193"
[3,] "PIC" "114.34894"
[4,] "MOCK" " 0.00000"
# Creating a matrix from slices of dataframe isn't desirable, not just for the row-ordering, but also because all entries are coerced to string. So just don't do it.
Now if you wanted in general to get a group for each Gene, then select just the Virus, Percentage columns, use dplyr:
> require(dplyr)
> ga_slice <- resFile %>% group_by(Gene) %>% select(Virus,Percentage) %>% ungroup() %>% select(-Gene)
Source: local data frame [4 x 2]
Virus Percentage
1 1Virus 38.59074
2 2Virus 64.33193
3 PIC 114.34894
4 MOCK 0.00000
Upvotes: 1
Reputation: 7941
You've got a couple of problems here with the line:
mat[samples,genes] <- resFile$Percentage
Firstly, if samples
and genes
are taken from resFile, they'll probably be factors rather than strings, so if the order of samples
or genes
differs from the order of the factor levels you'll get the rows or columns shuffled.
Secondly, this assigns the resFile$Percentage
to all the subset of mat
where the row name is in samples
and the column name in genes
, rather than taking them pairwise.
To get round these problems, try the following (I'm making assumptions about how you generated samples
and genes
:
resFile <- data.frame(Gene="ga",
Virus=c("1Virus","2Virus","PIC","MOCK"),
Percentage=c(38.59,64.33,114.34,0))
samples <- unique(resFile$Virus)
genes <- unique(resFile$Gene)
numGene <- length(genes)
numSmpl <- length(samples)
mat <- matrix(ncol=numGene,nrow=numSmpl,dimnames=list(samples,genes))
mat[cbind(as.character(resFile$Virus)),as.character(resFile$Gene)] <- resFile$Percentage
mat
# ga
# 1Virus 38.59
# 2Virus 64.33
# PIC 114.34
# MOCK 0.00
The key differences are that I've converted the factor variables to character, and indexed using a matrix rather than two vectors - see ?'['
for a better explanation of indexing by arrays than I can manage.
Upvotes: 2