Reputation: 1365

What is the most efficient way to return ranks of a vector within levels of a factor, as a vector having the same order/length as the original vector?

With one more requirement - that the resulting vector is in the same order as the original.

I have a very basic function that percentiles a vector, and works just the way I want it to do:

ptile <- function(x) {
  p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
  p[p > 1] <- NA
  p 
}

data <- c(1, 2, 3, 100, 200, 300)

For example, ptile(data) generates:

[1] 0.0 0.2 0.4 0.6 0.8 1.0

What I'd really like to be able to do is use this same function (ptile) and have it work within levels of a factor. So suppose I have a "factor" f as follows:

f <- as.factor(c("a", "a", "b", "a", "b", "b"))

I'd like to be able to transform "data" into a vector that tells me, for each observation, what its corresponding percentile is relative to other observations within its same level, like this:

0.0 0.5 0.0 1.0 0.5 1.0

As a shot in the dark, I tried:

tapply(data,f,ptile)

and see that it does, in fact, succeed at doing the ranking/percentiling, but does so in a way that I have no idea which observations match up to their indices in the original vector:

[1] a a b a b b
Levels: a b
> tapply(data,f,ptile)
$a
[1] 0.0 0.5 1.0

$b
[1] 0.0 0.5 1.0

This matters because the actual data I'm working with can have 1000-3000 observations (stocks) and 10-55 levels (things like sectors, groupings by other stock characteristics, etc), and I need the resulting vector to be in the same order as the way it went in, in order for everything to line up, row by row in my matrix.

Is there some "apply" variant that would do what I am seeking? Or a few quick lines that would do the trick? I've written this functionality in C# and F# with a lot more lines of code, but had figured that in R there must be some really direct, elegant solution. Is there?

Thanks in advance!

Upvotes: 4

Answers (3)

J. Win.

Reputation: 6771

When you call tapply() with INDEX=f you get a result that is subsetted by f and broken into a list in order of the levels of f. To reverse that process, simply:

unlist(tapply(data, f, ptile))[order(order(f))]

Your example data vector happened to be in numeric order already, but this works even if the data is in random order...

ptile <- function(x) {
  p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
  p[p > 1] <- NA
  # concatenated with the original data to make the match clear
  paste(round(p * 100, 2), x, sep="% ") 
}

data <- sample(c(1:5, (1:5)*100), 10)
f <- sample(letters[1:2], 10, replace=TRUE)
result <- unlist(tapply(data, f, ptile))[order(order(f))]

data.frame(result, data, f)

Upvotes: 2

IRTFM

Reputation: 263362

The ave function is very useful. The main gotcha is to remember that you always need to name the function with FUN=:

 dt <- data.frame(data, f)
 dt$rank <-  with(dt, ave(data, list(f), FUN=rank))
     dt
    #---
      data f rank
    1    1 a    1
    2    2 a    2
    3    3 b    1
    4  100 a    3
    5  200 b    2
    6  300 b    3

Edit: I thought I was answering the question in the title but have been asked to include the code that uses the "ptile" function:

> dt$ptile <-  with(dt, ave(data, list(f), FUN=ptile))
> dt
  data f rank ptile
1    1 a    1   0.0
2    2 a    2   0.5
3    3 b    1   0.0
4  100 a    3   1.0
5  200 b    2   0.5
6  300 b    3   1.0

Upvotes: 11

Prasad Chalasani

Reputation: 20282

For what you are trying to do, I would first put the stock, sector, value as columns in a data-frame. E.g with some made-up data:

> set.seed(1)
> df <- data.frame(stock = 1:10,
+                  sector = sample(letters[1:2], 10, repl = TRUE),
+                  val = sample(1:10))
> df
   stock sector val
1      1      a   3
2      2      a   2
3      3      b   6
4      4      b  10
5      5      a   5
6      6      b   7
7      7      b   8
8      8      b   4
9      9      b   1
10    10      a   9

Then you can use the ddply function from the plyr package to do the "sectorwise" percentile (there are other ways, but I find the plyr to be very useful, and would recommend you take a look at it):

require(plyr)
df.p <- ddply(df, .(sector), transform, pct = ptile(val))

Now of course in df.p the rows will be arranged by the factor (i.e. sector), and it's a simple matter to restore it to the original order, e.g.:

> df.p[ order(df.p$stock),]
   stock sector val       pct
1      1      a   3 0.3333333
2      2      a   2 0.0000000
5      3      b   6 0.4000000
6      4      b  10 1.0000000
3      5      a   5 0.6666667
7      6      b   7 0.6000000
8      7      b   8 0.8000000
9      8      b   4 0.2000000
10     9      b   1 0.0000000
4     10      a   9 1.0000000

In particular the pct column is the final vector you are seeking in your original question.

Upvotes: 2

What is the most efficient way to return ranks of a vector within levels of a factor, as a vector having the same order/length as the original vector?

Answers (3)

Related Questions