user3720887
user3720887

Reputation: 739

group_by date column in dplyr

After extensive searching on this issue, I still cannot find the solution. I have a simple data frame with 43 rows and 2 columns. My first column contains two dates. The first date is printed 19 times and the other 24 times. The second column is temperature. I want to find the max and min temperature by date, but my code keeps printing the entire data set's max and min.

Data:

Date <- c(rep(x = "2017-05-18", each= 19), rep(x = "2017-05-19", each= 24))


Temperature_F <- c(35, 35, 42, 49, 57, 63, 64, 67, 70, 71, 72, 71, 72, 70, 66, 61, 57, 54, 50, 49, 45, 44, 44, 42, 40, 39, 47, 53, 61, 67, 69, 
    72, 75, 76, 77, 76, 77, 75, 71, 66, 62, 58, 54)

NWS_temps1 <- data.frame(Date, Temperature_F)

Here is my dplyr code that keeps giving me the max and min for the entire temperature column when I think it should be giving me the max and min temperature by date.

NWS_temps1 <- tbl_df(NWS_temps1)

 NWS_temps1 %>%
  group_by(Date) %>% 
  summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))

The output I get is:

 Tmax Tmin
  77   35

When I am hoping for:

Date        Tmax Tmin
2017-05-18   72   35
2017-05-19   77   39

I don't understand why Date isn't be grouped as it should. I've attempted changing Date to a factor as it is here, character, date object, and even POSIXct, but my result is always the total data frame max and min.

Any help is much appreciated.

Thanks.

Upvotes: 3

Views: 10913

Answers (5)

Stewart Macdonald
Stewart Macdonald

Reputation: 2132

There are summarise functions in both the dplyr and plyr packages. I'm guessing that the order in which the packages were loaded meant that the plyr version of the function was being loaded, which would give you the results you were seeing. You can manually specify which version of the function you want to use by prepending the package name like this: dplyr::summarise(...).

# Specify the plyr version:
> NWS_temps1 %>%
+   group_by(Date) %>% 
+   plyr::summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))
  Tmax Tmin
1   77   35

# Specify the dplyr version:
> NWS_temps1 %>%
+   group_by(Date) %>% 
+   dplyr::summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3
  Date        Tmax  Tmin
  <fct>      <dbl> <dbl>
1 2017-05-18    72    35
2 2017-05-19    77    39

Edit: I've just noticed that Kim had already posted this as a comment on the original question.

Upvotes: 1

sbha
sbha

Reputation: 10422

I'm able to replicate the original group_by() issue when converting a date/time field represented as number to a date with as.Date() - this might happen when working with a date/time field imported from an Excel file because Excel stores dates as numbers.

library(dplyr)

dt = c(43167.86, 43167.59, 43167.59, 43167.23, 43182.60, 43168.17, 43182) 
df <- data_frame(date = dt)

df %>% 
  mutate(date = as.Date(date, origin = '1899-12-30')) %>% 
  group_by(date) %>% 
  summarize(obs = n())
# A tibble: 6 x 2
  date         obs
  <date>     <int>
1 2018-03-08     1
2 2018-03-08     2
3 2018-03-08     1
4 2018-03-09     1
5 2018-03-23     1
6 2018-03-23     1

That gives mulitple versions of the same dates for '2018-03-08' and '2018-03-23'. One line of '2018-03-08' has two observations because there are two '43167.59' - the same date and time, while there are two other 43167, but both with different times. This appears that it could be a dplyr related issue as table(as.Date(df$date, origin = '1899-12-30')) works as expected.

One option is using lubridate::ymd():

library(lubridate)

df %>% 
  mutate(date = as.Date(date, origin = '1899-12-30')) %>% 
  mutate(date = ymd(date)) %>% 
  group_by(date) %>% 
  summarize(obs = n())
# A tibble: 3 x 2
  date         obs
  <date>     <int>
1 2018-03-08     4
2 2018-03-09     1
3 2018-03-23     2

Another (crude) solution is to convert the date to a character, and then back if you want to keep it as a date:

df %>% 
  mutate(date = as.Date(date, origin = '1899-12-30')) %>% 
  mutate(date = as.Date(as.character(date))) %>% 
  group_by(date) %>% 
  summarize(obs = n())

The best solution might be stepping back a step and setting the column type as a date when importing with readxl::read_excel(). That will import the field as a date/time, but then as.Date() and group_by() will work as expected. Example from the vignette:

library(readxl)

df <- read_excel(readxl_example("type-me.xlsx"), sheet = "date_coercion",
                 col_types = c("date", "text")) 

Upvotes: 0

www
www

Reputation: 39154

The answers provided by others using dplyr should work. However, if for some reasons dplyr is not working. Here is a solution using tapply from base R.

dt <- data.frame(Date = unique(NWS_temps1$Date),
                 Tmax = tapply(NWS_temps1$Temperature_F, NWS_temps1$Date, FUN = max),
                 Tmin = tapply(NWS_temps1$Temperature_F, NWS_temps1$Date, FUN = min)) 

Upvotes: 1

Julia Silge
Julia Silge

Reputation: 11603

Looks like you are using the standard evaluation version group_by_() instead of the NSE version group_by(). Try it without the underscore:

NWS_temps1 %>%
    group_by(Date) %>% 
    summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))

#> # A tibble: 2 x 3
#>         Date  Tmax  Tmin
#>        <chr> <dbl> <dbl>
#> 1 2017-05-18    72    35
#> 2 2017-05-19    77    39

Upvotes: 1

akash87
akash87

Reputation: 3994

NWS_temps1 %>%
group_by(as.character(Date)) %>% 
summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))

Upvotes: 1

Related Questions