iskandarblue
iskandarblue

Reputation: 7526

unable to group by using dplyr

I have the following table where I would like to take the mean of tf_diff and group by cnrd_marsh and timefact_hour. However, when I use dplyr I get this error and am not sure why. I also tried to remove NA values and converting the first column from factor to character but that did not solve the problem. Any ideas? I have included a sample of the data below.

library(dplyr)
library(tidyr)

output <- foo %>%
  group_by(cnrd_marsh, timefact_hour) %>%
  summarise(tf_diff = mean(tf_diff)) %>%
  spread(cnrd_marsh, tf_diff)

Error: Key column 'cnrd_marsh' does not exist in input.



> foo
   cnrd_marsh timefact_hour tf_diff
1        <NA>            NA      NA
2          БЧ            14      19
3          БЧ            14       5
4        <NA>            NA      NA
5          БЧ            13       1
6          БЧ            13      18
7          БЧ            13      31
8          БЧ            13       2
9        <NA>            NA      NA
10         БЧ            12       5
11         БЧ            12      10
12         БЧ            12       1
13         БЧ            12      17
14       <NA>            NA      NA
15         БЧ            11       2
16         БЧ            11      18
17         БЧ            11       4
18         БЧ            11       7
19         БЧ            11      16
20         БЧ            11       3
21       <NA>            NA      NA
22         БЧ            10      11
23         БЧ            10       6
24         БЧ            10      10
25       <NA>            NA      NA
26         БЧ             9      17
27         БЧ             9       6
28         БЧ             9      15
29         БЧ             9       4
30         БЧ             9       9
31         БЧ             9       1
32       <NA>            NA      NA
33         БЧ             8      16
34         БЧ             8       8
35         БЧ             8      14
36         БЧ             8       3
37         БЧ             8      11
38         БЧ             8       1
39       <NA>            NA      NA
40         БЧ             7      14
41         БЧ             7       6
42         БЧ             7      14
43         БЧ             7       5
44       <NA>            NA      NA
45         БЧ             6       0
46         БЧ             6       9
47         БЧ             6      10
48         БЧ             6      15
49       <NA>            NA      NA
50       <NA>            NA      NA

the output should look like something like this

> head(output)
  cnrd_marsh timefact_hour           tf_diff
1         БЧ            14 12.00000000000000
2         БЧ            13 13.00000000000000
3         БЧ            12  8.25000000000000
4         БЧ            11  8.33333330000000
5         БЧ            10  9.00000000000000
6         БЧ             9  8.66666666666667

Here is a sample of the data.

> dput(foo)
structure(list(cnrd_marsh = c(NA, "БЧ", "БЧ", NA, "БЧ", 
"БЧ", "БЧ", "БЧ", NA, "БЧ", "БЧ", "БЧ", "БЧ", NA, 
"БЧ", "БЧ", "БЧ", "БЧ", "БЧ", "БЧ", NA, "БЧ", "БЧ", 
"БЧ", NA, "БЧ", "БЧ", "БЧ", "БЧ", "БЧ", "БЧ", NA, 
"БЧ", "БЧ", "БЧ", "БЧ", "БЧ", "БЧ", NA, "БЧ", "БЧ", 
"БЧ", "БЧ", NA, "БЧ", "БЧ", "БЧ", "БЧ", NA, NA), 
    timefact_hour = c(NA, 14L, 14L, NA, 13L, 13L, 13L, 13L, NA, 
    12L, 12L, 12L, 12L, NA, 11L, 11L, 11L, 11L, 11L, 11L, NA, 
    10L, 10L, 10L, NA, 9L, 9L, 9L, 9L, 9L, 9L, NA, 8L, 8L, 8L, 
    8L, 8L, 8L, NA, 7L, 7L, 7L, 7L, NA, 6L, 6L, 6L, 6L, NA, NA
    ), tf_diff = c(NA, 19, 5, NA, 1, 18, 31, 2, NA, 5, 10, 1, 
    17, NA, 2, 18, 4, 7, 16, 3, NA, 11, 6, 10, NA, 17, 6, 15, 
    4, 9, 1, NA, 16, 8, 14, 3, 11, 1, NA, 14, 6, 14, 5, NA, 0, 
    9, 10, 15, NA, NA)), .Names = c("cnrd_marsh", "timefact_hour", 
"tf_diff"), vars = list(timefact_hour), row.names = c(NA, 50L
), class = "data.frame")

Upvotes: 1

Views: 274

Answers (4)

leerssej
leerssej

Reputation: 14958

Just flush the dataset clean of NA's and you should be good to go.

output <- foo %>%
  na.omit %>%
  group_by(cnrd_marsh, timefact_hour) %>%
  summarise(tf_diff = mean(tf_diff)) %>%
  spread(cnrd_marsh, tf_diff)

Upvotes: 0

IRTFM
IRTFM

Reputation: 263301

If you change the value of vars in that dput value to a quoted value you can get your code to produce output:

foo <-
structure(list(cnrd_marsh = c(NA, "БЧ", "БЧ", NA, "БЧ", 
"БЧ", "БЧ", "БЧ", NA, "БЧ", "БЧ", "БЧ", "БЧ", NA, 
"БЧ", "БЧ", "БЧ", "БЧ", "БЧ", "БЧ", NA, "БЧ", "БЧ", 
"БЧ", NA, "БЧ", "БЧ", "БЧ", "БЧ", "БЧ", "БЧ", NA, 
"БЧ", "БЧ", "БЧ", "БЧ", "БЧ", "БЧ", NA, "БЧ", "БЧ", 
"БЧ", "БЧ", NA, "БЧ", "БЧ", "БЧ", "БЧ", NA, NA), 
    timefact_hour = c(NA, 14L, 14L, NA, 13L, 13L, 13L, 13L, NA, 
    12L, 12L, 12L, 12L, NA, 11L, 11L, 11L, 11L, 11L, 11L, NA, 
    10L, 10L, 10L, NA, 9L, 9L, 9L, 9L, 9L, 9L, NA, 8L, 8L, 8L, 
    8L, 8L, 8L, NA, 7L, 7L, 7L, 7L, NA, 6L, 6L, 6L, 6L, NA, NA
    ), tf_diff = c(NA, 19, 5, NA, 1, 18, 31, 2, NA, 5, 10, 1, 
    17, NA, 2, 18, 4, 7, 16, 3, NA, 11, 6, 10, NA, 17, 6, 15, 
    4, 9, 1, NA, 16, 8, 14, 3, 11, 1, NA, 14, 6, 14, 5, NA, 0, 
    9, 10, 15, NA, NA)), .Names = c("cnrd_marsh", "timefact_hour", 
"tf_diff"), vars = list('timefact_hour'), row.names = c(NA, 50L
), class = "data.frame")


 output <- foo %>%
   group_by(cnrd_marsh, timefact_hour) %>%
   summarise(tf_diff = mean(tf_diff)) %>%
   spread(cnrd_marsh, tf_diff)
 output
#-----------------------
Source: local data frame [10 x 3]

   timefact_hour         БЧ    NA
*          <int>      <dbl> <dbl>
1              6  8.5000000    NA
2              7  9.7500000    NA
3              8  8.8333333    NA
4              9  8.6666667    NA
5             10  9.0000000    NA
6             11  8.3333333    NA
7             12  8.2500000    NA
8             13 13.0000000    NA
9             14 12.0000000    NA
10            NA         NA    NA

Which does illustrate @joran's concern. Below is what I would consider the "base" solution which makes more sense to me:

> with(foo, tapply(tf_diff, list(timefact_hour, cnrd_marsh), mean, na.rm=TRUE))
          БЧ
6   8.500000
7   9.750000
8   8.833333
9   8.666667
10  9.000000
11  8.333333
12  8.250000
13 13.000000
14 12.000000

Upvotes: 2

user4349490
user4349490

Reputation: 153

Your error "Error: Key column 'cnrd_marsh' does not exist in input" is due to the fact that when you use summarise after group by it will just returns you "tf_diff" column. So while calculating spread, its not able to find 'cnrd_marsh' and 'tf_diff' columns. Instead try this. It should serve your purpose:

 foo <- foo[complete.cases(foo),] ### or spread will fail
 output <- foo %>%
              group_by(cnrd_marsh, timefact_hour) %>%
              mutate(tf_diff = mean(tf_diff,na.rm=T)) %>% unique() %>% 
              spread(cnrd_marsh, tf_diff)

Upvotes: 1

iskandarblue
iskandarblue

Reputation: 7526

This code seems to do the trick:

library(plyr)

cdata <- ddply(foo, c("cnrd_marsh", "timefact_hour"), summarise,
               mean = mean(tf_diff)
)
cdata

Upvotes: 0

Related Questions