Reputation: 29
I have a data frame that has the following columns:
> str(wbr)
'data.frame': 214 obs. of 12 variables:
$ countrycode : Factor w/ 214 levels "ABW","ADO","AFG",..: 1 2 3 4 5 6 7 8 9 10 ...
$ countryname : Factor w/ 214 levels "Afghanistan",..: 10 5 1 6 2 202 8 9 4 7 ...
$ gdp_per_capita : num 19913 35628 415 2738 4091 ...
$ literacy_female : num 96.7 NA 17.6 59.1 95.7 ...
$ literacy_male : num 96.9 NA 45.4 82.5 98 ...
$ literacy_all : num 96.8 NA 31.7 70.6 96.8 ...
$ infant_mortality : num NA 2.2 70.2 101.6 13.3 ...
$ illiteracy_female: num 3.28 NA 82.39 40.85 4.31 ...
$ illiteracy_mele : num 3.06 NA 54.58 17.53 1.99 ...
$ illiteracy_male : num 3.06 NA 54.58 17.53 1.99 ...
$ illiteracy_all : num 3.18 NA 68.26 29.42 3.15 ...
I would like to find the mean of illiteracy_all
from the top 50 countries with the highest GDP.
Before you answer me I need to inform you that the data frame has NA
values meaning that if I want to find the mean I would have to write:
mean(wbr$illiteracy_all, na.rm=TRUE)
For a reproducible example, let's take:
data.df <- data.frame(x=101:120, y=rep(c(1,2,3,NA), times=5))
So how could I average the y
values for e.g. the top 5 values of x
?
> data.df
x y
1 101 1
2 102 2
3 103 3
4 104 NA
5 105 1
6 106 2
7 107 3
8 108 NA
9 109 1
10 110 2
11 111 3
12 112 NA
13 113 1
14 114 2
15 115 3
16 116 NA
17 117 1
18 118 2
19 119 3
20 120 NA
Upvotes: 1
Views: 212
Reputation: 1954
Any of the following would work:
mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
To unpack why this works, note first that rank
gives ranks in a different order to what you might expect, 1
being the rank of the smallest number not the largest:
> rank(data.df$x)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
We can get round that by negating the input:
> rank(-data.df$x)
[1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
So now ranks 1
to 5
are the "top 5". If we want a vector of TRUE
and FALSE
to indicate the position of the top 5 we can use:
> rank(-data.df$x)<=5
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
(In reality you might find you have some ties in your data set. This is only going to cause issues if the 50th position is tied. You might want to have a look at the ties.method
argument for rank
to see how you want to handle this.)
So let's grab the values of y
in those positions:
> data.df[rank(-data.df$x)<=5,"y"]
[1] NA 1 2 3 NA
Or you could use:
> data.df$y[rank(-data.df$x)<=5]
[1] NA 1 2 3 NA
So now we know what to input into mean
:
> mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
[1] 2
Or:
> mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
[1] 2
Or if you don't like repeating the name of the data frame, use with
:
> with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
[1] 2
Upvotes: 2