Reputation: 167
I have a simple dataframe with two vectors, 'Speed' and 'ID' which looks like this:
mydata
ID Speed
1 1 6.031847
2 1 7.050654
3 1 7.769475
4 1 8.838968
5 1 9.956571
6 1 11.146864
7 1 11.967616
8 1 13.078422
9 1 14.214301
10 1 14.974159
11 2 16.048627
12 2 17.070484
.. . .........
I want to make a subset of the dataframe of the top 20% of the speed values with:
subset0.20<-subset(mydata, Speed > quantile(Speed, prob = 1 - 20/100, na.rm=T))
But I don't want it for the entire dataset since that would return to me unequal amounts of values for each ID.
Therefore, the top 20% values have to be calculated for each ID, and the results for each ID should then be combined together into a new dataframe. This dataframe would then consist of 8 rows (which is 20% of my original dataset which has 40 rows)
So I did some nailbiting and pulled out some hair and tried 'for loops' like:
for(i in 1:length(ID)){
subset0.80<-subset(mydata[i], GForce > quantile(Speed, prob = 1 - 20/100, na.rm=T))
}
And things like apply:
apply(mydata$Speed, 1 ,function(x) (subset(x > quantile(Speed, prob = 1 - 20/100, na.rm=T))))
But I'm just to inexperienced with R to get it to work.. Anyone that can help me out and explain to me all the things that I'm doing wrong here?
dput(mydata)
structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4), Speed = c(6.03184705225504, 7.05065401832249,
7.76947483668907, 8.83896842017956, 9.95657139135043, 11.1468640558647,
11.9676155772803, 13.0784218506988, 14.2143010441769, 14.9741594881612,
16.0486271520862, 17.0704843261466, 17.9324808839116, 19.1169673939822,
20.0528330256269, 20.9320440815571, 22.0379467007031, 22.962355355126,
24.0764744246649, 25.1182530133201, 26.0456043859692, 26.9528777031822,
27.9414746553538, 29.129640434174, 29.9443040639644, 30.9226103003052,
31.9932286699133, 32.9925644101585, 33.9930708538141, 35.0124438238874,
35.9215486087666, 36.9015465999988, 38.1044534443389, 39.0368063088987,
40.272189714015, 40.8993100278334, 41.9790311160737, 43.1027190745506,
43.8575622361406, 45.0499599122387)), .Names = c("ID", "Speed"
), row.names = c(NA, -40L), class = "data.frame")
Upvotes: 2
Views: 312
Reputation: 8753
Try this
library(plyr)
> ddply(mydata, .(ID), function(x) subset(x, Speed > quantile(Speed, prob = 1 - 20/100, na.rm=T)))
ID Speed
1 1 14.21430
2 1 14.97416
3 2 24.07647
4 2 25.11825
5 3 33.99307
6 3 35.01244
7 4 43.85756
8 4 45.04996
@SimonO101
try to use melt
from reshape2
res <- aggregate( Speed ~ ID , data = mydata , function(x) { y <- rev( seq( length(x) , by = -1 ,length.out =(length(x) / 5) ) ) ; cbind(x[y[1]],x[y[2]]) })
> melt(res, id.vars="ID")
ID variable value
1 1 Speed 14.21430
2 2 Speed 24.07647
3 3 Speed 33.99307
4 4 Speed 43.85756
5 1 Speed 14.97416
6 2 Speed 25.11825
7 3 Speed 35.01244
8 4 Speed 45.04996
Probably, after that, you may want to remove the second column :-).
Upvotes: 2
Reputation: 193517
There are several ways to do this (so many that it might be confusing). Here's one using ave
:
GetMe <- with(mydata,
ave(Speed, ID, FUN = function(x)
x > quantile(x, prob = 1 - 20/100, na.rm = TRUE)))
mydata[GetMe == 1, ]
# ID Speed
# 9 1 14.21430
# 10 1 14.97416
# 19 2 24.07647
# 20 2 25.11825
# 29 3 33.99307
# 30 3 35.01244
# 39 4 43.85756
# 40 4 45.04996
The data.table
package is also nice for this:
library(data.table)
DT <- data.frame(mydata)
DT[, list(Speed = Speed[Speed > quantile(Speed, prob = 1 - 20/100, na.rm = TRUE)]), by = "ID"]
# ID Speed
# 1: 1 14.21430
# 2: 1 14.97416
# 3: 2 24.07647
# 4: 2 25.11825
# 5: 3 33.99307
# 6: 3 35.01244
# 7: 4 43.85756
# 8: 4 45.04996
Upvotes: 4
Reputation: 59970
What the hell. Here is a one line solution using aggregate
from base
R. You get a slightly different data format one row per ID, and each speed value in it's own column:
aggregate( Speed ~ ID , data = mydata , function(x) { y <- rev( seq( length(x) , by = -1 ,length.out =(length(x) / 5) ) ) ; cbind(x[y[1]],x[y[2]]) } )
ID Speed.1 Speed.2
#1 1 14.21430 14.97416
#2 2 24.07647 25.11825
#3 3 33.99307 35.01244
#4 4 43.85756 45.04996
Upvotes: 1
Reputation: 17189
One way is to split
your data by ID
and then use lapply
on resultant list of dataframes to find your top 20% quantiles. Finally, use do.call
and rbind
to bind result together.
result <- do.call(rbind, lapply(split(mydata, mydata$ID), function(X) {
subset(X, Speed > quantile(Speed, prob = 1 - 20/100, na.rm = T))
}))
result
## ID Speed
## 1.9 1 14.21430
## 1.10 1 14.97416
## 2.19 2 24.07647
## 2.20 2 25.11825
## 3.29 3 33.99307
## 3.30 3 35.01244
## 4.39 4 43.85756
## 4.40 4 45.04996
Upvotes: 2
Reputation: 121568
Using by
, you can call the subset
function for each ID. Then you can bind
the result using do.call
which transform the list
to a data.frame
.
You can do something like this :
do.call(rbind,by(mydata,mydata$ID,FUN= function(x)
subset(x, Speed > quantile(Speed, prob = 1 - 20/100, na.rm=T))))
ID Speed
1.9 1 14.21430
1.10 1 14.97416
2.19 2 24.07647
2.20 2 25.11825
3.29 3 33.99307
3.30 3 35.01244
4.39 4 43.85756
4.40 4 45.04996
Upvotes: 4