Wrong count when summarizing dataframe with ddply

Question

In my dataset I have a continuous variable mag from which I derived a categorial variable mag.cat which has four categories: 1 for the values between 0 and 1 for mag, 2 for the values between 1 and 2 for mag, 3 for the values between 2 and 3 for mag & 4 for the values above 3 for mag. A subset of the data looks like this:

   location mag depth mag.cat
1     Assen 1.8   1.0       2
2 Hooghalen 2.5   1.5       3
3 Purmerend 0.7   1.2       1
4     Emmen 2.2   3.0       3
5 Geelbroek 3.6   3.0       4
6   Eleveld 2.7   3.0       3

I want to summarize this dataframe into a new one with just one row for each location.

I did this with:

df.new <- ddply(df, .(location), summarise, n.tot = as.numeric(length(location)), 
                gem.mag = round(mean(mag),1), 
                n.1 = as.numeric(length(location[mag == 1])),
                n.2 = as.numeric(length(location[mag == 2])),
                n.3 = as.numeric(length(location[mag == 3])),
                n.4 = as.numeric(length(location[mag == 4]))
                )

The n.1, n.2, n.3 & n.4 variables are supposed to contain the counts for each category for each location. The sum of these variables should logically be equal to n.tot, but they aren't. This can be seen in the head of the new dataframe:

      location  n.tot gem.mag n.1 n.2 n.3 n.4
1   Achterdiep      5     1.1   2   0   0   0
2      Alkmaar      4     3.2   0   0   1   0
3       Altena      1     1.3   0   0   0   0
4 Amelanderwad      2     1.8   0   0   0   0
5         Amen      6     1.1   0   0   0   0
6     Amerbrug      1     0.9   0   0   0   0

I expected something like:

      location  n.tot gem.mag n.1 n.2 n.3 n.4
1   Achterdiep      5     1.1   2   2   0   1
2      Alkmaar      4     3.2   0   3   1   0
3       Altena      1     1.3   0   1   0   0
4 Amelanderwad      2     1.8   0   1   1   0
5         Amen      6     1.1   3   2   0   1
6     Amerbrug      1     0.9   1   0   0   0

What am I doing wrong?

IRTFM · Accepted Answer

Why not:

res <- xtabs( ~ location + mag.cat, data=df)
res

If you want totals as a column, then cbind(tot.n= rowSums(res), res).

Means of mag: with(df, tapply(mag, location, mean))

Everything:

 cbind( gem.mag= with(df, tapply(mag, location, mean)),
        tot.n= rowSums(res), 
        res)

I suppose the answer is not complete without a plyr version:

 require(plyr)
 df.new <- ddply(df, .(location), summarise, n.tot = as.numeric(length(location)), 
                gem.mag = round(mean(mag),1), 
                n.1 = sum(mag.cat==1),
                n.2 = sum(mag.cat==2),
                n.3 = sum(mag.cat==3),
                n.4 = sum(mag.cat==4)
                )
> df.new
   location n.tot gem.mag n.1 n.2 n.3 n.4
1     Assen     1     1.8   0   1   0   0
2   Eleveld     1     2.7   0   0   1   0
3     Emmen     1     2.2   0   0   1   0
4 Geelbroek     1     3.6   0   0   0   1
5 Hooghalen     1     2.5   0   0   1   0
6 Purmerend     1     0.7   1   0   0   0

Wrong count when summarizing dataframe with ddply

Answers (1)

Related Questions