A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193527

Function for median similar to "which.max" and "which.min" / Extracting median rows from a data.frame

I occasionally need to extract specific rows from a data.frame based on values from one of the variables. R has built-in functions for maximum (which.max()) and minimum (which.min()) that allow me to easily extract those rows.

Is there an equivalent for median? Or is my best bet to just write my own function?

Here's an example data.frame and how I would use which.max() and which.min():

set.seed(1) # so you can reproduce this example
dat = data.frame(V1 = 1:10, V2 = rnorm(10), V3 = rnorm(10), 
                 V4 = sample(1:20, 10, replace=T))

# To return the first row, which contains the max value in V4
dat[which.max(dat$V4), ]
# To return the seventh row, which contains the min value in V4
dat[which.min(dat$V4), ]

For this particular example, since there are an even number of observations, I would need to have two rows returned, in this case, rows 2 and 10.

Update

It would seem that there is not a built-in function for this. As such, using the reply from Sacha as a starting point, I wrote this function:

which.median = function(x) {
  if (length(x) %% 2 != 0) {
    which(x == median(x))
  } else if (length(x) %% 2 == 0) {
    a = sort(x)[c(length(x)/2, length(x)/2+1)]
    c(which(x == a[1]), which(x == a[2]))
  }
}

I'm able to use it as follows:

# make one data.frame with an odd number of rows
dat2 = dat[-10, ]
# Median rows from 'dat' (even number of rows) and 'dat2' (odd number of rows)
dat[which.median(dat$V4), ]
dat2[which.median(dat2$V4), ]

Are there any suggestions to improve this?

Upvotes: 8

Views: 11856

Answers (7)

Yun
Yun

Reputation: 195

A simple function to do this:

which.median <- function(x) {
                ordering <- order(x)
                if ((len <- length(x)) == 0L) {
                    integer()
                } else if (len %% 2L == 0) {
                    ordering[len / 2L + 0:1]
                } else {
                    ordering[(len + 1L) / 2L]
                }
     }

Upvotes: 0

Feiming Chen
Feiming Chen

Reputation: 79

We only need a function that returns the locations of values by matching approximately:

match.approx <- function(x, y) {
    ## Purpose: Match Approximately for Numerical Data
    ## Arguments:
    ##   "x":  a vector of numeric values.
    ##   "y":  a vector of numeric values. 
    ## RETURN:
    ##   The index in "y" that indicates the closest y value to each of "x" value. 
    ## ________________________________________________
    
    sapply(x, function(x0) which.min(abs(x0 - y)))
}
if (F) {
  match.approx(c(4.2, 1.2, 15), 1:10)                #  4  1 10
}

Here is an example of finding the locations of quantiles:

set.seed(1)
a <- rnorm(100)
match.approx(quantile(a), a)
# 0%  25%  50%  75% 100% 
# 14   29   23   63   61

Upvotes: 0

aprstar
aprstar

Reputation: 101

Building on the answers given by Sacha and cbeleites, here is a function to get inclusive quantile indices. One difference from previous answers is that the type argument is exposed and will produce slightly different quantile results (see ?quantile). If performance is an issue, one could replace the sapply with a version from the parallel package - something like unlist(mclapply(...)).

# Extract indices corresponding to inclusive quantiles
# EXAMPLE:
#
#   x <- c(2.34, 5.83, NA, 9.34, 8.53, 6.42, NA, 8.07, NA, 0.77)
#   probs <- c(0, .23, .5, .6, 1)
#   which.quantile(x, probs, na.rm = TRUE)
#
# OUTPUT: 10  1  6  8  4
#
#   x[ which.quantile(x, probs, na.rm = TRUE) ]
#
# OUTPUT: 0.77 2.34 6.42 8.07 9.34
#
#   x <- c(2, 1, 3)
#   p <- c(0.5)
#   x[ which.quantile(x, p) ]
#
# OUTPUT: 2
which.quantile <- function (x,
                            probs,
                            na.rm = FALSE,
                            type = 7) {
  stopifnot(all(probs >= 0.0))
  stopifnot(all(probs <= 1.0))
  quants = quantile(x,
                    probs = probs,
                    na.rm = na.rm,
                    type = type)
  which.nearest <- function(quant) {
    return(which.min(abs(x - quant)))
  }
  return(sapply(X = quants, FUN = which.nearest))
}

Upvotes: 1

Yimai
Yimai

Reputation: 87

Suppose the vector from which you want to get the median is x.

The function which.min(x[x>=median(x)]) would give the median if length(x)=2*n+1 or the larger of the two middle values if length(x)=2*n. You can tweak it slightly if you want to get the smaller of the two middle values.

Upvotes: 2

cbeleites
cbeleites

Reputation: 14093

While Sacha's solution is quite general, the median (or other quantiles) are order statistics, so you can calculate the corresponding indices from order (x) (instead of sort (x) for the quantile values).

Looking into quantile, types 1 or 3 could be used, all others lead to (weighted) averages of two values in certain cases.

I chose type 3, and a bit of copy & paste from quantile leads to:

which.quantile <- function (x, probs, na.rm = FALSE){
  if (! na.rm & any (is.na (x)))
  return (rep (NA_integer_, length (probs)))

  o <- order (x)
  n <- sum (! is.na (x))
  o <- o [seq_len (n)]

  nppm <- n * probs - 0.5
  j <- floor(nppm)
  h <- ifelse((nppm == j) & ((j%%2L) == 0L), 0, 1)
  j <- j + h

  j [j == 0] <- 1
  o[j]
}

A little test:

> x <-c (2.34, 5.83, NA, 9.34, 8.53, 6.42, NA, 8.07, NA, 0.77)
> probs <- c (0, .23, .5, .6, 1)
> which.quantile (x, probs, na.rm = TRUE)
[1] 10  1  6  6  4
> x [which.quantile (x, probs, na.rm = TRUE)] == quantile (x, probs, na.rm = TRUE, type = 3)

  0%  23%  50%  60% 100% 
TRUE TRUE TRUE TRUE TRUE 

Here's your example:

> dat [which.quantile (dat$V4, c (0, .5, 1)),]
  V1         V2          V3 V4
7  7  0.4874291 -0.01619026  1
2  2  0.1836433  0.38984324 13
1  1 -0.6264538  1.51178117 17

Upvotes: 16

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193527

I've written a more comprehensive function that serves my needs:

row.extractor = function(data, extract.by, what) {
# data = your data.frame
# extract.by = the variable that you are extracting by, either
#              as its index number or by name
# what = either "min", "max", "median", or "all", with quotes
  if (is.numeric(extract.by) == 1) {
    extract.by = extract.by
  } else if (is.numeric(extract.by) != 0) {
    extract.by = which(colnames(dat) %in% "extract.by")
  } 
  which.median = function(data, extract.by) {
    a = data[, extract.by]
    if (length(a) %% 2 != 0) {
      which(a == median(a))
    } else if (length(a) %% 2 == 0) {
      b = sort(a)[c(length(a)/2, length(a)/2+1)]
      c(max(which(a == b[1])), min(which(a == b[2])))
    }
  }
  X1 = data[which(data[extract.by] == min(data[extract.by])), ] 
  X2 = data[which(data[extract.by] == max(data[extract.by])), ]
  X3 = data[which.median(data, extract.by), ]
  if (what == "min") {
    X1
  } else if (what == "max") {
    X2
  } else if (what == "median") {
    X3
  } else if (what == "all") {
    rbind(X1, X3, X2)
  }
}

Some example usage:

> row.extractor(dat, "V4", "max")
  V1         V2       V3 V4
1  1 -0.6264538 1.511781 17
> row.extractor(dat, 4, "min")
  V1        V2          V3 V4
7  7 0.4874291 -0.01619026  1
> row.extractor(dat, "V4", "all")
   V1         V2          V3 V4
7   7  0.4874291 -0.01619026  1
2   2  0.1836433  0.38984324 13
10 10 -0.3053884  0.59390132 14
4   1 -0.6264538  1.51178117 17

Upvotes: 2

Sacha Epskamp
Sacha Epskamp

Reputation: 47562

I think just:

which(dat$V4 == median(dat$V4))

But be careful there since the median takes the mean of two numbers if there isn't a single middle number. E.g. median(1:4) gives 2.5 which doesn't match any of the elements.

Edit

Here is a function which will give you either the element of the median or the first match to the mean of the median, similar to how which.min() gives you the first element that is equal to the minimum only:

whichmedian <- function(x) which.min(abs(x - median(x)))

For example:

> whichmedian(1:4)
[1] 2

Upvotes: 9

Related Questions