user1613628
user1613628

Reputation: 45

selecting values above x gives wrong results from matrix table. Can someone explain why?

I am new to R and I am using the Shapiro-Wilk test to test a set of data for normality. My problem is not in using the test but in generating a results table to identify the rows of results where the p-value is greater than 0.05. To illustrate my question, I am using the golub dataset which gives a series of gene expression values from "ALL" and "AML" patients.

What I have done is as follows:

library (multtest)
data (golub)
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))

# the golub dataset has the expression values for 3051 genes so I've decided to use only the first 10 genes from the dataset to make it easier to work with
ALL10 <- golub[1:10, gol.fac=="ALL"]

# calculate Shapiro-Wilk test for normality
sh10 <- apply (ALL10, 1, function(x) shapiro.test(x)$p.value)

# get the names of the first 10 genes from the golub.gnames matrix
ALL10names <- golub.gnames[1:10,2]

# combine gene names with normality p-value scores 
list10 <- cbind(ALL10names,sh10)

# find those that have normal distribution
normdist<- list10[,2]>0.05

# print a list of those with normal distribution
list10[which(normdist),]

the result I get is:

              ALL10names                                sh10                  
[1,] "AFFX-HUMISGF3A/M97935_MA_at (endogenous control)" "2.97359627770755e-07"
[2,] "AFFX-HUMISGF3A/M97935_3_at (endogenous control)"  "0.299103621399385"   
[3,] "AFFX-HUMGAPDH/M33197_5_at (endogenous control)"   "6.60564216346286e-07"
[4,] "AFFX-HUMGAPDH/M33197_M_at (endogenous control)"   "6.81945800629973e-07"
[5,] "AFFX-HSAC07/X00351_5_at (endogenous control)"     "3.3088559810058e-06" 
[6,] "AFFX-HSAC07/X00351_M_at (endogenous control)"     "1.30227973255158e-08"

As you can see, this is wrong! there are several values that are actually < 0.05 and only one which is actually >0.05 (which is what I want)

If I do:

which(normdist)
[1]  1  3  7  8  9 10

but

which (sh10 > 0.05)
[1] 3

so obviously the error occured at

normdist<- list10[,2]>0.05

My question is why did this happen? I want everything whose value is >0.05 in column 2 of list10...it looks right but yet I am getting a wrong result. As I said, I am learning R and so I want to understand what went wrong so I don't repeat my mistake. Thanks in advance!

Upvotes: 1

Views: 339

Answers (1)

mnel
mnel

Reputation: 115392

Your problem is that cbind(ALL10names,sh10) creates a matrix and coerces everything to a character.

Create a data.frame instead (columns can have different types)

list10 <- data.frame(ALL10names,sh10)

and everything will work as you want

Upvotes: 5

Related Questions