Isabella
Isabella

Reputation: 27

Using apply function to calculate the mean of a column

After splitting a data frame into multiple data frames by country,I wanted to be able to calculate the mean of the column centralization in each country's data frame that i split. I used tapply which worked and I tried to use sapply() but the weird thing is that all mean values of the country follows the mean value of the first country. I cannot figure out why and I am asked to use sapply as an exercise so I would like to know how i can improve on my code. Any pointer would be appreciated. (it might be a dumb mistake)

INPUT/my code:

strikes.df = read.csv("http://www.stat.cmu.edu/~pfreeman/strikes.csv")
strikes.by.country=split(strikes.df,strikes.df$country)
my.fun=function(x=strikes.by.country){
  l=length(strikes.by.country)
  for (i in 1:l){
    return(strikes.by.country[[i]]$centralization %>% mean)
  }
}

sapply(strikes.by.country, my.fun)

#using tapply()
tapply(strikes.df[,"centralization",],INDEX=strikes.df[,"country",],FUN=mean)

OUTPUT

   0.374644    0.374644    0.374644    0.374644    0.374644 
    Finland      France     Germany     Ireland       Italy 
   0.374644    0.374644    0.374644    0.374644    0.374644 
      Japan Netherlands New.Zealand      Norway      Sweden 
   0.374644    0.374644    0.374644    0.374644    0.374644 
Switzerland          UK         USA 
   0.374644    0.374644    0.374644

 
  Australia     Austria     Belgium      Canada     Denmark 
0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 
    Finland      France     Germany     Ireland       Italy 
0.750374065 0.002729909 0.249968231 0.499711882 0.250699502 
      Japan Netherlands New.Zealand      Norway      Sweden 
0.124675342 0.749602699 0.375940378 0.875341821 0.875253817 
Switzerland          UK         USA 
0.499990005 0.375946785 0.002390639 

i am given instruction to use sapply after using split; thats why the only thing that occured to me is using for loops.

Upvotes: 2

Views: 1330

Answers (2)

jay.sf
jay.sf

Reputation: 72593

Better use sapply on the unique country names. Actually there's no need to split anything.

sapply(unique(strikes.df$country), function(x) 
  mean(strikes.df[strikes.df$country == x, "centralization"]))
#   Australia     Austria     Belgium      Canada     Denmark     Finland      France 
# 0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 0.750374065 0.002729909 
#     Germany     Ireland       Italy       Japan Netherlands New.Zealand      Norway 
# 0.249968231 0.499711882 0.250699502 0.124675342 0.749602699 0.375940378 0.875341821 
#      Sweden Switzerland          UK         USA 
# 0.875253817 0.499990005 0.375946785 0.002390639 

But if you depend on using split as well, you may do:

sapply(split(strikes.df$centralization, strikes.df$country), mean)
#   Australia     Austria     Belgium      Canada     Denmark     Finland      France 
# 0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 0.750374065 0.002729909 
#     Germany     Ireland       Italy       Japan Netherlands New.Zealand      Norway 
# 0.249968231 0.499711882 0.250699502 0.124675342 0.749602699 0.375940378 0.875341821 
#      Sweden Switzerland          UK         USA 
# 0.875253817 0.499990005 0.375946785 0.002390639 

Or write it in two lines:

s <- split(strikes.df$centralization, strikes.df$country)
sapply(s, mean)

Edit

If splitting the whole data frame is required, do

s <- split(strikes.df, strikes.df$country)
sapply(s, function(x) mean(x[, "centralization"]))

or

foo <- function(x) mean(x[, "centralization"])
sapply(s, foo)

Upvotes: 1

stefan
stefan

Reputation: 123783

Using the gapminder::gapminder dataset as example data this can be achieved like so:

The example code computes mean life expectancy (lifeExp) by continent.

# sapply: simplifies. returns a vector
sapply(split(gapminder::gapminder, gapminder::gapminder$continent), function(x) mean(x$lifeExp, na.rm = TRUE))
#>   Africa Americas     Asia   Europe  Oceania 
#> 48.86533 64.65874 60.06490 71.90369 74.32621
# lapply: returns a list
lapply(split(gapminder::gapminder, gapminder::gapminder$continent), function(x) mean(x$lifeExp, na.rm = TRUE))
#> $Africa
#> [1] 48.86533
#> 
#> $Americas
#> [1] 64.65874
#> 
#> $Asia
#> [1] 60.0649
#> 
#> $Europe
#> [1] 71.90369
#> 
#> $Oceania
#> [1] 74.32621

Upvotes: 0

Related Questions