uncool
uncool

Reputation: 2703

aggregate function - NA is still outputted as na.action is set to omit

I'm quite new to R. So I'm a little confused right now.

I'm using the aggregate function on a list, now it generates all values correctly except for those columns containing NA's. I'm computing the mean.

The data in question is below

  AreaSize constructionYear
6        30             1980
7        30               NA
13       30             1969

Now the aggregate table gives this.

  SegGroup listPrice   rent livingArea constructionYear soldPrice
1       20   2383750 1353.0   25.87500           1927.5   2813750
2       30   2161667 1856.0   36.50000               NA   2428333
3       40   3548333 2381.0   44.16667               NA   3858333
4       50   2261667 3601.0   56.66667               NA   2616667
5       60   2395000 3320.0   63.00000           1954.0   2700000
6       70   3837500 3274.0   72.50000           1946.5   3942500
7       80   3335000 4759.5   82.75000           1986.0   3400000
8       90   2720000 4017.5   92.50000           1950.0   3475000

Even though the na.action = na.omit within the aggregate function (set by default). What is wrong?

Code

listPrice  <- aggregate(lOriginal[-length(lOriginal)], list(lOriginal$AreaSize), FUN = mean)

Upvotes: 1

Views: 2341

Answers (1)

Stibu
Stibu

Reputation: 15927

According to the help on aggregate, na.action = na.omit is the default in the method for formula objects, but not in the method for data frames. Which method is used is determined by the class of the first argument in your function call.

I don't have your data, so I show you what this means using the data set mtcars, which is included in R, with a modification (which is needed, because mtcars contains no NA):

mtcars[5, "disp"] <- NA

Now, I aggregate the columns disp and mpg by cyl. First, I use the data frame method:

aggregate(mtcars[, c("mpg", "disp")], list(cyl = mtcars$cyl), mean)
##   cyl      mpg     disp
## 1   4 26.66364 105.1364
## 2   6 19.74286 183.3143
## 3   8 15.10000       NA

Clearly, the NA values are not omitted. However, mean() comes with an argument na.rm, which I can set to TRUE as follows:

aggregate(mtcars[, c("mpg","disp")], list(cyl = mtcars$cyl), mean, na.rm = TRUE)
##   cyl      mpg     disp
## 1   4 26.66364 105.1364
## 2   6 19.74286 183.3143
## 3   8 15.10000 352.5692

(The reason that this works can also be found in the documentation of aggregate(). The function has an argument ... (as many R functions do), which will match all the expressions that you pass to the function that do not match one of its arguments. These expressions are than passed on to the function that you use for aggregation. Since aggregate() has no argument called na.rm, this argument will sent on to mean().)

Now back to what caused your confusion: you can also use aggregate by giving a formula as the first argument (which I find more readable and thus preferable). The call then reads as follows:

aggregate(cbind(mpg, disp) ~ cyl, data = mtcars, mean)
##   cyl      mpg     disp
## 1   4 26.66364 105.1364
## 2   6 19.74286 183.3143
## 3   8 14.82308 352.5692

As you can see, in this form the NA values are indeed omitted by default.

Upvotes: 2

Related Questions