Panos Makris
Panos Makris

Reputation: 29

Find mean from subset of one column based on ranking in the top 50 of another column

I have a data frame that has the following columns:

> str(wbr)
'data.frame':   214 obs. of  12 variables:
 $ countrycode      : Factor w/ 214 levels "ABW","ADO","AFG",..: 1 2 3 4 5 6 7 8 9 10 ...
     $ countryname      : Factor w/ 214 levels "Afghanistan",..: 10 5 1 6 2 202 8 9 4 7 ...
 $ gdp_per_capita   : num  19913 35628 415 2738 4091 ...
     $ literacy_female  : num  96.7 NA 17.6 59.1 95.7 ...
 $ literacy_male    : num  96.9 NA 45.4 82.5 98 ...
     $ literacy_all     : num  96.8 NA 31.7 70.6 96.8 ...
 $ infant_mortality : num  NA 2.2 70.2 101.6 13.3 ...
     $ illiteracy_female: num  3.28 NA 82.39 40.85 4.31 ...
 $ illiteracy_mele  : num  3.06 NA 54.58 17.53 1.99 ...
     $ illiteracy_male  : num  3.06 NA 54.58 17.53 1.99 ...
 $ illiteracy_all   : num  3.18 NA 68.26 29.42 3.15 ...

I would like to find the mean of illiteracy_all from the top 50 countries with the highest GDP.

Before you answer me I need to inform you that the data frame has NA values meaning that if I want to find the mean I would have to write:

mean(wbr$illiteracy_all, na.rm=TRUE)

For a reproducible example, let's take:

data.df <- data.frame(x=101:120, y=rep(c(1,2,3,NA), times=5))

So how could I average the y values for e.g. the top 5 values of x?

> data.df
     x  y
1  101  1
2  102  2
3  103  3
4  104 NA
5  105  1
6  106  2
7  107  3
8  108 NA
9  109  1
10 110  2
11 111  3
12 112 NA
13 113  1
14 114  2
15 115  3
16 116 NA
17 117  1
18 118  2
19 119  3
20 120 NA

Upvotes: 1

Views: 212

Answers (1)

Silverfish
Silverfish

Reputation: 1954

Any of the following would work:

mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)

mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)

with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))

To unpack why this works, note first that rank gives ranks in a different order to what you might expect, 1 being the rank of the smallest number not the largest:

> rank(data.df$x)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

We can get round that by negating the input:

> rank(-data.df$x)
 [1] 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1

So now ranks 1 to 5 are the "top 5". If we want a vector of TRUE and FALSE to indicate the position of the top 5 we can use:

> rank(-data.df$x)<=5
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

(In reality you might find you have some ties in your data set. This is only going to cause issues if the 50th position is tied. You might want to have a look at the ties.method argument for rank to see how you want to handle this.)

So let's grab the values of y in those positions:

> data.df[rank(-data.df$x)<=5,"y"]
[1] NA  1  2  3 NA

Or you could use:

> data.df$y[rank(-data.df$x)<=5]
[1] NA  1  2  3 NA

So now we know what to input into mean:

> mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
[1] 2

Or:

> mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
[1] 2

Or if you don't like repeating the name of the data frame, use with:

> with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
[1] 2

Upvotes: 2

Related Questions