Maximilian
Maximilian

Reputation: 4229

Use aggregate and keep NA rows

I have not spent such a time on one single task like this for years.

There are multiple hints here on SO for example: here or here so one is tempted to say this is a duplicate (I would even say so). But with the examples and multiple trials I was not able to accomplish what's needed.

Here is full example:

x <- data.frame(idx=1:30, group=rep(letters[1:10],3), val=runif(30))

x$val[sample.int(nrow(x), 5)] <- NA; x
spl <- with(x, split(x, group))

lpp <- lapply(spl, 
          function(x) { r <- with(x, 
              data.frame(x, val_g=cut(val, seq(0,1,0.1), labels = FALSE),
                            val_g_lab=cut(val, seq(0,1,0.1)))); r })


rd <- do.call(rbind, lpp); ord <- rd[order(rd$idx, decreasing = FALSE), ]; ord

aggregate(val ~ group + val_g_lab, ord, 
          FUN=function(x) c(mean(x, na.rm = FALSE), 
                            sum(!is.na(x))), na.action=na.pass)

The desired ouput: I would like to have also the NA's included, after aggregate(). Currently the aggregate() drops the NA's rows.

      idx group        val val_g val_g_lab  
 a.1    1     a 0.53789249     6 (0.5,0.6]          
 b.2    2     b 0.01729695     1   (0,0.1]          
 c.3    3     c 0.62295270     7 (0.6,0.7]          
 d.4    4     d 0.60291892     7 (0.6,0.7]
 e.5    5     e 0.76422909     8 (0.7,0.8]
 f.6    6     f 0.87433547     9 (0.8,0.9]
 g.7    7     g         NA    NA      <NA>          
 h.8    8     h 0.50590159     6 (0.5,0.6]
 i.9    9     i 0.89084068     9 (0.8,0.9]
 ...... continue (full data set as @ord object.

Upvotes: 1

Views: 705

Answers (1)

Anders Ellern Bilgrau
Anders Ellern Bilgrau

Reputation: 10223

A work-around is simply not to use NA for the value groups. First, initialising your data as above:

x <- data.frame(idx=1:30, group=rep(letters[1:10],3), val=runif(30))

x$val[sample.int(nrow(x), 5)] <- NA; x
spl <- with(x, split(x, group))

lpp <- lapply(spl, 
      function(x) { r <- with(x, 
          data.frame(x, val_g=cut(val, seq(0,1,0.1), labels = FALSE),
                        val_g_lab=cut(val, seq(0,1,0.1)))); r })


rd <- do.call(rbind, lpp); 
ord <- rd[order(rd$idx, decreasing = FALSE), ];

Simply convert to character and covert NAs to some arbitrary string literal:

# Convert to character
ord$val_g_lab <- as.character(ord$val_g_lab)
# Convert NAs
ord$val_g_lab[is.na(ord$val_g_lab)] <- "Unknown"

aggregate(val ~ group + val_g_lab, ord, 
          FUN=function(x) c(mean(x, na.rm = FALSE), sum(!is.na(x))), 
          na.action=na.pass)
#   group val_g_lab      val.1      val.2
#1      e   (0,0.1] 0.02292533 1.00000000
#2      g (0.1,0.2] 0.16078353 1.00000000
#3      g (0.2,0.3] 0.20550228 1.00000000
#4      i (0.2,0.3] 0.26986665 1.00000000
#5      j (0.2,0.3] 0.23176149 1.00000000
#6      d (0.3,0.4] 0.39196441 1.00000000
#7      e (0.3,0.4] 0.39303518 1.00000000
#8      g (0.3,0.4] 0.35646994 1.00000000
#9      i (0.3,0.4] 0.35724889 1.00000000
#10     a (0.4,0.5] 0.48809261 1.00000000
#11     b (0.4,0.5] 0.40993166 1.00000000
#12     d (0.4,0.5] 0.42394859 1.00000000
# ...
#20     b   (0.9,1] 0.99562918 1.00000000
#21     c   (0.9,1] 0.92018049 1.00000000
#22     f   (0.9,1] 0.91379088 1.00000000
#23     h   (0.9,1] 0.93445802 1.00000000
#24     j   (0.9,1] 0.93325098 1.00000000
#25     b   Unknown         NA 0.00000000
#26     c   Unknown         NA 0.00000000
#27     d   Unknown         NA 0.00000000
#28     i   Unknown         NA 0.00000000
#29     j   Unknown         NA 0.00000000

Does this do what you want?

Edit:

To answer your question in the comments. Note NaN and NA are not quite the same (See here). Note also that these two are very different from "NaN" and "NA", which are string literals (i.e. just text). But anyway, NAs are special 'atomic' elements which are nearly always handled exceptionally by functions. So you have to look into the documentation how a particular function handles NAs. In this case, the na.action argument applies to the values that you aggregate over, not the 'classes' in your formula. The drop=FALSE argument could also be used, but then you get all combinations of the (in this case) two classifications. Redefining the NA to a string literal works because the new name is treated like any other class.

Upvotes: 1

Related Questions