Reputation: 247
I have a file with about 17000 rows and I preformed a simple linear regression on
Gene_id expA expB
GeneA 5.462109 5.006181
GeneB 2.667692 4.208152
GeneC 4.796976 4.122660
GeneD 3.127125 3.676322
GeneE 4.500583 4.104575
GeneF 4.598430 4.853717
And I preformed the regression analysis using
plot(log2(data$expA)~log2(data$expB))
regression <- lm(log2(moved.data$expA)~log2(moved.data$expB))
abline(regression)
I am interested in what Genes which are outliers are from my regression analysis.
I tried using the identify(log2(data$expA)~log2(data$expB), row.names(data))
function but I have a lot of points in my graph so to go click over them one by one doesn't seem feasible to me.
I also looked here: Outliers with robust regression in R
but that doesn't tell me how to figure out the Gene names of the outliers
Upvotes: 2
Views: 12419
Reputation: 11158
Aren't you really interested in the residuals (how much the results for each gene differs from the regression)? Residuals tell you how closely your data fits your model. In this case, you have a simple linear model. Having many large residuals would indicate that you need a better model.
I would never remove outliers if at all possible. If you do remove them, please be honest and report that they were removed and the criteria for their removal. To be completely honest and transparent, make your data and R script available.
Anyway, keeping what I just said in mind, here is a way to name the "outliers":
# Create sample data
Gene_id <- c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE", "GeneF")
expA <- c(5.462109, 2.667692, 4.796976, 3.127125, 4.500583, 4.598430)
expB <- c(5.006181, 4.208152, 4.122660, 3.676322, 4.104575, 4.853717)
# Calculate log base two for results from each experiment
log2_A <- log2(expA)
log2_B <- log2(expB)
# Plot data
plot(log2_A~log2_B)
# Calculate and display the regression line
regression <- lm(log2_A~log2_B)
abline(regression)
# Show regression formula
print(regression)
# Create data frame for sample data
data <- data.frame(Gene_id, expA, expB)
# Calculate residuals
data$residuals <- residuals(regression)
# Choose a threshhold
outlier_threshold <- 0.3
# Print only names of outliers
outliers <- data[ abs(data$residuals) > outlier_threshold, ]
print(outliers$Gene_id)
Upvotes: 2