HungryA
HungryA

Reputation: 1

Finding extreme values in a normal distribution

I want to find extreme values (anything greater or less than three times standard deviation from the mean) after generating a set of random numbers using:

num = rnorm(1000)

My code looks like the following:

extreme = function(varname) {
  for(i in varname) {
    count = 0
    m     = mean(varname)
    sd    = 3*sd(varname)
    if(i<(m-sd) || i>(m+sd)) {
      count = count + 1
    }
  }
  if(count>0) {
    print(paste("There are ", count, " extreme values found.", sep = ""))
  } else print("There are no extreme values.")
}

I'm always getting "There are no extreme values." I'm a beginner in R, so are there truly no extreme values in any randomly generated set of numbers following normal distribution?

Upvotes: 0

Views: 2391

Answers (2)

EngrStudent
EngrStudent

Reputation: 2022

You need an outlier first. All the measurements are well behaved, and properly drawn from the same distribution. They are all the legitimate children of a normal distribution. Outliers are looking for mutant/halfbreed/aliens. You need to have an alien in the mix.

Lets say, for discussion sake, that you are measuring the coplanarity of solderballs on a chip (to ground this in the concrete). Lets say there are 1000 solderballs per part. Lets say that the manufacturing technician who puts solderballs in the hopper, spilled some of the wrong-size (too small), and didn't tell anyone. Lets say that there is 10% bad balls.

What this means, physically speaking, is that, there are two clusters. The smaller balls are going to have center positions closer to the substrate, and they are going to have smaller naturally occuring variance.

Lets say that the POR (process of record) solderballs are 12 mills +/- 1.2 mills, and the wrong-size ones are 10 mills +/- 1.0 mills. You would simulate this as two normal components.

N <- 1000 #solderballs
n1 <- N *0.90 # good solderballs
n2 <- N * 0.10 # bad solderballs 

mu1 <- -(12-10) *0.10 #weighted impact to the center of this cluster
mu2 <- (12-10) *0.90 #weighted impact to the center of this cluster

sig1 <- 1.2
sig2 <- 1.0

num = c(rnorm(n1,mean=mu1,sd=sig1),rnorm(n2,mean=mu2,sd=sig2))

Yay code.

You don't want to start throwing it against formulas that you aren't sure about. Like a bucketbrigade for information, you want to keep as much water in the buckets as you pass them from passer to passer. Putting your data into a summary statistic means you are keeping one number and losing 999. That can be an information-lossy operation.

The human brain is the best computer known. It makes deep blue or Tianhe look like an abacus. Use it first. Let it look at all what you have. Beat up your data with the tools that are best for the job. Stand on the shoulders of giants.

My suggestion: EDA (aka exploratory data analysis). NIST, the National Institute of Standards and Technology, have made good tools here. Their Einsteins are smart, and made tools to enable people - you and me. Mere mortals. So here is the EDA link. http://www.itl.nist.gov/div898/handbook/eda/eda.htm

Make some plots. Good plots that tell you about your data. Textbook is the 4-plot. If you don't 4-plot your data then you don't know it. Don't make an equation until you have made a few human-readable and understandable plots. It will protect you, make your results good.

Aside: I don't think there is a good library for R to make these plots and I wish there was. It would be nice to go "4plot(mydata)" and get a 4-plot.

So lets make a trendplot, a lag-plot, a histogram, and a normal probability plot. These are ways of feeding the data into the mind to train it.

Here, you can make graphs with this:

plot(num, type="l")
lines(lowess(num,f=1/10),col="Red")
lag.plot(num)
hist(num,probability=TRUE, breaks=20)
qqnorm(num)
grid()

Given the nature of the problem, JEDEC says "compute the range" or "compute the max distance from the mean to the tails". These are two different metrics, and they start being able to detect outliers at ~2.8 to 3.0 standard deviations when you are looking for "Tiffanies". It is a fundamentally different proposition if you are looking for 1 outlier or for 100 (or even 2).

Personally, for the "Tiffany problem", I have metrics that trigger reliably at 1.55 standard deviations from the mean. Your best start though, is with JEDEC and its 2.8-3.0 onset of detection.

Best of luck.

Upvotes: 0

Harvey Motulsky
Harvey Motulsky

Reputation: 201

Setting aside the programming questions, this question also brings up a statistical question.

If you sample size is huge, then the sample SD computed from your values will be close to the population SD, and it may make sense to ask about values more than 3SD from the mean.

But if your sample is small, any outlier will increase the value you compute as the sample SD. This means you may never get to 3 SDs.

Define Z as enter image description here.

with a sample of N observations, Z can never get larger than enter image description here . Accordingly n must be 11 or larger for there to be any possibility of an outlier being more than 3 SD from the mean. Grubbs outlier test is based on this idea, so has its own table of how many Sds from the mean define an outlier for a set value of alpha.

Grubbs, F. E. Procedures for detecting outlying observations in samples. Technometrics 11, 1–21 (1969).

Upvotes: 2

Related Questions