Reputation: 2703
I'm quite new to R. So I'm a little confused right now.
I'm using the aggregate function on a list, now it generates all values correctly except for those columns containing NA's. I'm computing the mean.
The data in question is below
AreaSize constructionYear
6 30 1980
7 30 NA
13 30 1969
Now the aggregate table gives this.
SegGroup listPrice rent livingArea constructionYear soldPrice
1 20 2383750 1353.0 25.87500 1927.5 2813750
2 30 2161667 1856.0 36.50000 NA 2428333
3 40 3548333 2381.0 44.16667 NA 3858333
4 50 2261667 3601.0 56.66667 NA 2616667
5 60 2395000 3320.0 63.00000 1954.0 2700000
6 70 3837500 3274.0 72.50000 1946.5 3942500
7 80 3335000 4759.5 82.75000 1986.0 3400000
8 90 2720000 4017.5 92.50000 1950.0 3475000
Even though the na.action = na.omit within the aggregate function (set by default). What is wrong?
Code
listPrice <- aggregate(lOriginal[-length(lOriginal)], list(lOriginal$AreaSize), FUN = mean)
Upvotes: 1
Views: 2341
Reputation: 15927
According to the help on aggregate
, na.action = na.omit
is the default in the method for formula objects, but not in the method for data frames. Which method is used is determined by the class of the first argument in your function call.
I don't have your data, so I show you what this means using the data set mtcars
, which is included in R, with a modification (which is needed, because mtcars
contains no NA
):
mtcars[5, "disp"] <- NA
Now, I aggregate the columns disp
and mpg
by cyl
. First, I use the data frame method:
aggregate(mtcars[, c("mpg", "disp")], list(cyl = mtcars$cyl), mean)
## cyl mpg disp
## 1 4 26.66364 105.1364
## 2 6 19.74286 183.3143
## 3 8 15.10000 NA
Clearly, the NA
values are not omitted. However, mean()
comes with an argument na.rm
, which I can set to TRUE
as follows:
aggregate(mtcars[, c("mpg","disp")], list(cyl = mtcars$cyl), mean, na.rm = TRUE)
## cyl mpg disp
## 1 4 26.66364 105.1364
## 2 6 19.74286 183.3143
## 3 8 15.10000 352.5692
(The reason that this works can also be found in the documentation of aggregate()
. The function has an argument ...
(as many R functions do), which will match all the expressions that you pass to the function that do not match one of its arguments. These expressions are than passed on to the function that you use for aggregation. Since aggregate()
has no argument called na.rm
, this argument will sent on to mean()
.)
Now back to what caused your confusion: you can also use aggregate by giving a formula as the first argument (which I find more readable and thus preferable). The call then reads as follows:
aggregate(cbind(mpg, disp) ~ cyl, data = mtcars, mean)
## cyl mpg disp
## 1 4 26.66364 105.1364
## 2 6 19.74286 183.3143
## 3 8 14.82308 352.5692
As you can see, in this form the NA
values are indeed omitted by default.
Upvotes: 2