Reputation: 27
After splitting a data frame into multiple data frames by country,I wanted to be able to calculate the mean of the column centralization in each country's data frame that i split. I used tapply which worked and I tried to use sapply() but the weird thing is that all mean values of the country follows the mean value of the first country. I cannot figure out why and I am asked to use sapply as an exercise so I would like to know how i can improve on my code. Any pointer would be appreciated. (it might be a dumb mistake)
INPUT/my code:
strikes.df = read.csv("http://www.stat.cmu.edu/~pfreeman/strikes.csv")
strikes.by.country=split(strikes.df,strikes.df$country)
my.fun=function(x=strikes.by.country){
l=length(strikes.by.country)
for (i in 1:l){
return(strikes.by.country[[i]]$centralization %>% mean)
}
}
sapply(strikes.by.country, my.fun)
#using tapply()
tapply(strikes.df[,"centralization",],INDEX=strikes.df[,"country",],FUN=mean)
OUTPUT
0.374644 0.374644 0.374644 0.374644 0.374644
Finland France Germany Ireland Italy
0.374644 0.374644 0.374644 0.374644 0.374644
Japan Netherlands New.Zealand Norway Sweden
0.374644 0.374644 0.374644 0.374644 0.374644
Switzerland UK USA
0.374644 0.374644 0.374644
Australia Austria Belgium Canada Denmark
0.374644022 0.997670495 0.749485177 0.002244134 0.499958552
Finland France Germany Ireland Italy
0.750374065 0.002729909 0.249968231 0.499711882 0.250699502
Japan Netherlands New.Zealand Norway Sweden
0.124675342 0.749602699 0.375940378 0.875341821 0.875253817
Switzerland UK USA
0.499990005 0.375946785 0.002390639
i am given instruction to use sapply after using split; thats why the only thing that occured to me is using for loops.
Upvotes: 2
Views: 1330
Reputation: 72593
Better use sapply
on the unique
country names. Actually there's no need to split anything.
sapply(unique(strikes.df$country), function(x)
mean(strikes.df[strikes.df$country == x, "centralization"]))
# Australia Austria Belgium Canada Denmark Finland France
# 0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 0.750374065 0.002729909
# Germany Ireland Italy Japan Netherlands New.Zealand Norway
# 0.249968231 0.499711882 0.250699502 0.124675342 0.749602699 0.375940378 0.875341821
# Sweden Switzerland UK USA
# 0.875253817 0.499990005 0.375946785 0.002390639
But if you depend on using split
as well, you may do:
sapply(split(strikes.df$centralization, strikes.df$country), mean)
# Australia Austria Belgium Canada Denmark Finland France
# 0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 0.750374065 0.002729909
# Germany Ireland Italy Japan Netherlands New.Zealand Norway
# 0.249968231 0.499711882 0.250699502 0.124675342 0.749602699 0.375940378 0.875341821
# Sweden Switzerland UK USA
# 0.875253817 0.499990005 0.375946785 0.002390639
Or write it in two lines:
s <- split(strikes.df$centralization, strikes.df$country)
sapply(s, mean)
If split
ting the whole data frame is required, do
s <- split(strikes.df, strikes.df$country)
sapply(s, function(x) mean(x[, "centralization"]))
or
foo <- function(x) mean(x[, "centralization"])
sapply(s, foo)
Upvotes: 1
Reputation: 123783
Using the gapminder::gapminder
dataset as example data this can be achieved like so:
The example code computes mean life expectancy (lifeExp
) by continent
.
# sapply: simplifies. returns a vector
sapply(split(gapminder::gapminder, gapminder::gapminder$continent), function(x) mean(x$lifeExp, na.rm = TRUE))
#> Africa Americas Asia Europe Oceania
#> 48.86533 64.65874 60.06490 71.90369 74.32621
# lapply: returns a list
lapply(split(gapminder::gapminder, gapminder::gapminder$continent), function(x) mean(x$lifeExp, na.rm = TRUE))
#> $Africa
#> [1] 48.86533
#>
#> $Americas
#> [1] 64.65874
#>
#> $Asia
#> [1] 60.0649
#>
#> $Europe
#> [1] 71.90369
#>
#> $Oceania
#> [1] 74.32621
Upvotes: 0