chance.will
chance.will

Reputation: 41

What is the easiest way to find the smallest interval that contains 90% of the values in an array using R?

I'm given arrays of numbers between 1 and 4, but usually they don't differ more than .5 between the min and max. The difference between each element is no smaller than .1. I want to find the smallest margin that contains at least 90% (or some other specified rate) of the elements.

That is, given the array

c(1, 1.9, 2, 2, 2, 2, 2.1, 2.2, 2.3, 2.3)

I want my function to return .4 because 2.3 - 1.9 = .4 < 2.3 - 1 = 1.3. Details:

I tried to build the function a few times, but it keeps growing overly complicated, and I'm wondering if there's a simple way to do this that I haven't considered.

Edit: it has to be able to satisfy skewed distributions. I don't have any completed examples of code I produced since I keep reconstructing it, but I'll make something and post it.

Edit2: I can't provide any examples of the arrays I want to feed into function, but Here's a function for generating similar values. It's not important that it doesn't fall in the 1 to 4 range as long as it works.

x = round(rbeta(20,5,2)*100)/10

Upvotes: 3

Views: 1096

Answers (3)

Frank
Frank

Reputation: 66819

Here's one way (same as @Aaron's except head/tail instead of x[i]):

x = c(1, 1.9, 2, 2, 2, 2, 2.1, 2.2, 2.3, 2.3)
xn= length(x)

# number of elements to drop
n = round(0.1*xn) 

# achievable ranges
v = tail(x, n+1) - head(x, n+1)

min(v)
# [1] 0.4

Confirmation that a subvector of x dropping n elements really has this range:

n_up = which.min(v) - 1
n_dn = n-n_up

xs = x[(1 + n_up):(xn - n_dn)]

diff(range(xs))
# [1] 0.4
length(x) - length(xs) == n
# [1] TRUE

Testing on new example:

set.seed(1)
x0 = round(rbeta(20,5,2)*100)/10
x = sort(x0)
xn= length(x)

n = round(0.1*xn)
v = tail(x, n+1) - head(x, n+1)

min(v)
# [1] 4.1

# confirm...
n_up = which.min(v) - 1
n_dn = n-n_up    
xs = x[(1 + n_up):(xn - n_dn)]

diff(range(xs))
# [1] 4.1
length(x) - length(xs) == n
# [1] TRUE

Partial sorting might be sufficient (just to get the top and bottom values on the ends); see ?sort.

Upvotes: 4

Aaron - mostly inactive
Aaron - mostly inactive

Reputation: 37764

The easiest way will be to brute force by testing all possible ranges that include 90%. To do this, we figure out how many terms that is, and what indices the ranges therefore can start at, and compute the difference for each, and then the minimum of those.

x <- c(1, 1.9, 2, 2, 2, 2, 2.1, 2.2, 2.3, 2.3)
n <- ceiling(length(x)*0.9)   # get the number of terms needed to include 90%
k <- 1 : (length(x) - n + 1)  # get the possible indices the range can start at
x <- sort(x)                  # need them sorted...
d <- x[k + n - 1] - x[k]      # get the difference starting at each range
min(d)                        # get the smallest difference

Upvotes: 5

Rui Barradas
Rui Barradas

Reputation: 76470

This can be solved with quantile.

  1. Compute the 0.05 and 0.95 quantiles.
  2. Get the values of x that are within those limits. Call this vector in_90.
  3. Return the difference between the minimum and the maximum of those values of in_90.

The sequence of instructions would be this.

qq <- quantile(x, c(0.05, 0.95))
in_90 <- x[qq[1] <= x & x <= qq[2]]
diff(range(in_90))
#[1] 0.4

As a function:

amplitude <- function(x, conf = 0.9){
  quants <- c((1 - conf)/2, 1 - (1 - conf)/2)
  qq <- quantile(x, quants)
  inside <- x[qq[1] <= x & x <= qq[2]]
  diff(range(inside))
}

amplitude(x)
#[1] 0.4

Data.

x <- c(1, 1.9, 2, 2, 2, 2, 2.1, 2.2, 2.3, 2.3)

Upvotes: 1

Related Questions