crazysantaclaus
crazysantaclaus

Reputation: 623

R, aggregate function apparently causes loss of column levels?

I just encountered a weird situation in RGui...I used the same script as always to get my data.frame into the right shape for ggplot2. So my data looks like the following:

      time days treatment nucleic_acid habitat  parallel   disturbance     variable  cellcounts      value
1    1    2   control          dna   water        1         none     Proteobacteria       batch     0.000000000
2    2   22   control          dna   water        1         none     Proteobacteria       batch     0.003586543
3    1    2   treated          dna   water        1         none     Proteobacteria       batch     0.000000000
4    2   22   treated          dna   biofilm      1         none     Proteobacteria       NA        0.000000000

'data.frame':   185648 obs. of  10 variables:
 $ time        : int  5 5 5 5 5 5 6 6 6 6 ...
 $ days        : int  62 62 62 62 62 62 69 69 69 69 ...
 $ treatment   : Factor w/ 2 levels "control","treated": 2 2 2 1 1 1 2 2 2 1 ...
 $ parallel    : int  1 2 3 1 2 3 1 2 3 1 ...
 $ nucleic_acid: Factor w/ 2 levels "cdna","dna": 1 1 1 1 1 1 1 1 1 1 ...
 $ habitat     : Factor w/ 2 levels "biofilm","water": 1 1 1 1 1 1 1 1 1 1 ...
 $ cellcounts  : Factor w/ 4 levels "batch","high",..: NA NA NA NA NA NA NA NA NA NA ...
 $ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
 $ variable    : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value       : num  0 0 0 0 0 0 0 0 0 0 ...

and I wanted aggregate to calculate the mean value of my up to 3 parallels:

df_mean<-aggregate(value~time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean)

afterwards, the level "biofilm" in column "habitat" is lost.

df_mean<-droplevels(df_mean)

str(df_mean)
'data.frame':   44608 obs. of  9 variables:
 $ time        : int  1 2 1 2 1 2 1 2 1 2 ...
 $ days        : int  2 22 2 22 2 22 2 22 2 22 ...
 $ treatment   : Factor w/ 2 levels "control","treated": 1 1 2 2 1 1 2 2 1 1 ...
 $ nucleic_acid: Factor w/ 2 levels "cdna","dna": 2 2 2 2 2 2 2 2 2 2 ...
 $ habitat     : Factor w/ 1 level "water": 1 1 1 1 1 1 1 1 1 1 ...
 $ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
 $ variable    : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ cellcounts  : Factor w/ 4 levels "batch","high",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value       : num  0 0.00359 0 0 0 ...

So I spent a lot of time (I actually just realised this, there were many more issues that now all seem to be aggregate related) looking into this. I removed the column "cellcounts" and it worked. Interestingly, the columns "cellcounts" and "habitat" always carry in case of "biofilm" the same, therefore redundant, information ("biofilm" goes always with "NA"). Is this the cause? But it always worked before, so I don't get my head around this. Was there a change to the base::aggregate function or something like that? Do you have an explanation for me? I'm using R-3.4.0, other packages used are reshape, reshape2 and ggplot2

Thx a lot, a confused crazysantaclaus

Upvotes: 0

Views: 311

Answers (1)

moodymudskipper
moodymudskipper

Reputation: 47300

The issue comes from the NA, maybe your file was loaded differently in the past and these were stored as strings instead of NA values ? Here's a way to solve it by setting them to a "NA" string:

levels(df$cellcounts) <- c(levels(df$cellcounts),"NA")
df$cellcounts[is.na(df$cellcounts)] <- "NA"
df_mean <- aggregate(value ~ time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean,na.rm=TRUE)
df_mean<-droplevels(df_mean)
str(df_mean)

'data.frame':   4 obs. of  9 variables:
  $ time        : int  1 2 1 2
$ days        : int  2 22 2 22
$ treatment   : Factor w/ 2 levels "control","treated": 1 1 2 2
$ nucleic_acid: Factor w/ 1 level "dna": 1 1 1 1
$ habitat     : Factor w/ 2 levels "biofilm","water": 2 2 2 1
$ disturbance : Factor w/ 1 level "none": 1 1 1 1
$ variable    : Factor w/ 1 level "Proteobacteria": 1 1 1 1
$ cellcounts  : Factor w/ 2 levels "batch","NA": 1 1 1 2
$ value       : num  0 0.00359 0 0

data

df <- read.table(text="      time days treatment nucleic_acid habitat  parallel   disturbance     variable  cellcounts      value
    1    1    2   control          dna   water        1         none     Proteobacteria       batch     0.000000000
                        2    2   22   control          dna   water        1         none     Proteobacteria       batch     0.003586543
                        3    1    2   treated          dna   water        1         none     Proteobacteria       batch     0.000000000
                        4    2   22   treated          dna   biofilm      1         none     Proteobacteria       NA        0.000000000

                        ",header=T)

Upvotes: 1

Related Questions