Reputation: 1365
With one more requirement - that the resulting vector is in the same order as the original.
I have a very basic function that percentiles a vector, and works just the way I want it to do:
ptile <- function(x) {
p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
p[p > 1] <- NA
p
}
data <- c(1, 2, 3, 100, 200, 300)
For example, ptile(data)
generates:
[1] 0.0 0.2 0.4 0.6 0.8 1.0
What I'd really like to be able to do is use this same function (ptile) and have it work within levels of a factor. So suppose I have a "factor" f as follows:
f <- as.factor(c("a", "a", "b", "a", "b", "b"))
I'd like to be able to transform "data" into a vector that tells me, for each observation, what its corresponding percentile is relative to other observations within its same level, like this:
0.0 0.5 0.0 1.0 0.5 1.0
As a shot in the dark, I tried:
tapply(data,f,ptile)
and see that it does, in fact, succeed at doing the ranking/percentiling, but does so in a way that I have no idea which observations match up to their indices in the original vector:
[1] a a b a b b
Levels: a b
> tapply(data,f,ptile)
$a
[1] 0.0 0.5 1.0
$b
[1] 0.0 0.5 1.0
This matters because the actual data I'm working with can have 1000-3000 observations (stocks) and 10-55 levels (things like sectors, groupings by other stock characteristics, etc), and I need the resulting vector to be in the same order as the way it went in, in order for everything to line up, row by row in my matrix.
Is there some "apply" variant that would do what I am seeking? Or a few quick lines that would do the trick? I've written this functionality in C# and F# with a lot more lines of code, but had figured that in R there must be some really direct, elegant solution. Is there?
Thanks in advance!
Upvotes: 4
Views: 1009
Reputation: 6771
When you call tapply()
with INDEX=f
you get a result that is subsetted by f
and broken into a list in order of the levels of f
. To reverse that process, simply:
unlist(tapply(data, f, ptile))[order(order(f))]
Your example data
vector happened to be in numeric order already, but this works even if the data is in random order...
ptile <- function(x) {
p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
p[p > 1] <- NA
# concatenated with the original data to make the match clear
paste(round(p * 100, 2), x, sep="% ")
}
data <- sample(c(1:5, (1:5)*100), 10)
f <- sample(letters[1:2], 10, replace=TRUE)
result <- unlist(tapply(data, f, ptile))[order(order(f))]
data.frame(result, data, f)
Upvotes: 2
Reputation: 263362
The ave function is very useful. The main gotcha is to remember that you always need to name the function with FUN=
:
dt <- data.frame(data, f)
dt$rank <- with(dt, ave(data, list(f), FUN=rank))
dt
#---
data f rank
1 1 a 1
2 2 a 2
3 3 b 1
4 100 a 3
5 200 b 2
6 300 b 3
Edit: I thought I was answering the question in the title but have been asked to include the code that uses the "ptile" function:
> dt$ptile <- with(dt, ave(data, list(f), FUN=ptile))
> dt
data f rank ptile
1 1 a 1 0.0
2 2 a 2 0.5
3 3 b 1 0.0
4 100 a 3 1.0
5 200 b 2 0.5
6 300 b 3 1.0
Upvotes: 11
Reputation: 20282
For what you are trying to do, I would first put the stock, sector, value as columns in a data-frame. E.g with some made-up data:
> set.seed(1)
> df <- data.frame(stock = 1:10,
+ sector = sample(letters[1:2], 10, repl = TRUE),
+ val = sample(1:10))
> df
stock sector val
1 1 a 3
2 2 a 2
3 3 b 6
4 4 b 10
5 5 a 5
6 6 b 7
7 7 b 8
8 8 b 4
9 9 b 1
10 10 a 9
Then you can use the ddply
function from the plyr
package to do the "sectorwise" percentile (there are other ways, but I find the plyr
to be very useful, and would recommend you take a look at it):
require(plyr)
df.p <- ddply(df, .(sector), transform, pct = ptile(val))
Now of course in df.p
the rows will be arranged by the factor (i.e. sector
), and it's a simple matter to restore it to the original order, e.g.:
> df.p[ order(df.p$stock),]
stock sector val pct
1 1 a 3 0.3333333
2 2 a 2 0.0000000
5 3 b 6 0.4000000
6 4 b 10 1.0000000
3 5 a 5 0.6666667
7 6 b 7 0.6000000
8 7 b 8 0.8000000
9 8 b 4 0.2000000
10 9 b 1 0.0000000
4 10 a 9 1.0000000
In particular the pct
column is the final vector you are seeking in your original question.
Upvotes: 2