user3816990
user3816990

Reputation: 247

Identify all outliers from regression analysis

I have a file with about 17000 rows and I preformed a simple linear regression on

    Gene_id expA expB
    GeneA 5.462109    5.006181
    GeneB 2.667692    4.208152
    GeneC 4.796976    4.122660
    GeneD 3.127125    3.676322
    GeneE 4.500583    4.104575
    GeneF 4.598430    4.853717

And I preformed the regression analysis using

plot(log2(data$expA)~log2(data$expB))
regression <- lm(log2(moved.data$expA)~log2(moved.data$expB))
abline(regression)

I am interested in what Genes which are outliers are from my regression analysis.

I tried using the identify(log2(data$expA)~log2(data$expB), row.names(data)) function but I have a lot of points in my graph so to go click over them one by one doesn't seem feasible to me.

I also looked here: Outliers with robust regression in R

but that doesn't tell me how to figure out the Gene names of the outliers

Upvotes: 2

Views: 12419

Answers (1)

Christopher Bottoms
Christopher Bottoms

Reputation: 11158

Aren't you really interested in the residuals (how much the results for each gene differs from the regression)? Residuals tell you how closely your data fits your model. In this case, you have a simple linear model. Having many large residuals would indicate that you need a better model.

I would never remove outliers if at all possible. If you do remove them, please be honest and report that they were removed and the criteria for their removal. To be completely honest and transparent, make your data and R script available.

Anyway, keeping what I just said in mind, here is a way to name the "outliers":

# Create sample data
Gene_id <- c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE", "GeneF")
expA    <- c(5.462109, 2.667692, 4.796976, 3.127125, 4.500583, 4.598430)
expB    <- c(5.006181, 4.208152, 4.122660, 3.676322, 4.104575, 4.853717)

# Calculate log base two for results from each experiment
log2_A <- log2(expA)
log2_B <- log2(expB)

# Plot data
plot(log2_A~log2_B)

# Calculate and display the regression line
regression <- lm(log2_A~log2_B)
abline(regression)

# Show regression formula
print(regression)

# Create data frame for sample data
data <- data.frame(Gene_id, expA, expB)

# Calculate residuals
data$residuals <- residuals(regression)

# Choose a threshhold
outlier_threshold <- 0.3

# Print only names of outliers
outliers <- data[ abs(data$residuals) > outlier_threshold, ]
print(outliers$Gene_id)

Upvotes: 2

Related Questions