Reputation: 739
After extensive searching on this issue, I still cannot find the solution. I have a simple data frame with 43 rows and 2 columns. My first column contains two dates. The first date is printed 19 times and the other 24 times. The second column is temperature. I want to find the max and min temperature by date, but my code keeps printing the entire data set's max and min.
Data:
Date <- c(rep(x = "2017-05-18", each= 19), rep(x = "2017-05-19", each= 24))
Temperature_F <- c(35, 35, 42, 49, 57, 63, 64, 67, 70, 71, 72, 71, 72, 70, 66, 61, 57, 54, 50, 49, 45, 44, 44, 42, 40, 39, 47, 53, 61, 67, 69,
72, 75, 76, 77, 76, 77, 75, 71, 66, 62, 58, 54)
NWS_temps1 <- data.frame(Date, Temperature_F)
Here is my dplyr code that keeps giving me the max and min for the entire temperature column when I think it should be giving me the max and min temperature by date.
NWS_temps1 <- tbl_df(NWS_temps1)
NWS_temps1 %>%
group_by(Date) %>%
summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))
The output I get is:
Tmax Tmin
77 35
When I am hoping for:
Date Tmax Tmin
2017-05-18 72 35
2017-05-19 77 39
I don't understand why Date isn't be grouped as it should. I've attempted changing Date to a factor as it is here, character, date object, and even POSIXct, but my result is always the total data frame max and min.
Any help is much appreciated.
Thanks.
Upvotes: 3
Views: 10913
Reputation: 2132
There are summarise
functions in both the dplyr
and plyr
packages. I'm guessing that the order in which the packages were loaded meant that the plyr
version of the function was being loaded, which would give you the results you were seeing. You can manually specify which version of the function you want to use by prepending the package name like this: dplyr::summarise(...)
.
# Specify the plyr version:
> NWS_temps1 %>%
+ group_by(Date) %>%
+ plyr::summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))
Tmax Tmin
1 77 35
# Specify the dplyr version:
> NWS_temps1 %>%
+ group_by(Date) %>%
+ dplyr::summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3
Date Tmax Tmin
<fct> <dbl> <dbl>
1 2017-05-18 72 35
2 2017-05-19 77 39
Edit: I've just noticed that Kim had already posted this as a comment on the original question.
Upvotes: 1
Reputation: 10422
I'm able to replicate the original group_by()
issue when converting a date/time field represented as number to a date with as.Date()
- this might happen when working with a date/time field imported from an Excel file because Excel stores dates as numbers.
library(dplyr)
dt = c(43167.86, 43167.59, 43167.59, 43167.23, 43182.60, 43168.17, 43182)
df <- data_frame(date = dt)
df %>%
mutate(date = as.Date(date, origin = '1899-12-30')) %>%
group_by(date) %>%
summarize(obs = n())
# A tibble: 6 x 2
date obs
<date> <int>
1 2018-03-08 1
2 2018-03-08 2
3 2018-03-08 1
4 2018-03-09 1
5 2018-03-23 1
6 2018-03-23 1
That gives mulitple versions of the same dates for '2018-03-08' and '2018-03-23'. One line of '2018-03-08' has two observations because there are two '43167.59' - the same date and time, while there are two other 43167, but both with different times. This appears that it could be a dplyr
related issue as table(as.Date(df$date, origin = '1899-12-30'))
works as expected.
One option is using lubridate::ymd()
:
library(lubridate)
df %>%
mutate(date = as.Date(date, origin = '1899-12-30')) %>%
mutate(date = ymd(date)) %>%
group_by(date) %>%
summarize(obs = n())
# A tibble: 3 x 2
date obs
<date> <int>
1 2018-03-08 4
2 2018-03-09 1
3 2018-03-23 2
Another (crude) solution is to convert the date to a character, and then back if you want to keep it as a date:
df %>%
mutate(date = as.Date(date, origin = '1899-12-30')) %>%
mutate(date = as.Date(as.character(date))) %>%
group_by(date) %>%
summarize(obs = n())
The best solution might be stepping back a step and setting the column type as a date when importing with readxl::read_excel()
. That will import the field as a date/time, but then as.Date()
and group_by()
will work as expected. Example from the vignette:
library(readxl)
df <- read_excel(readxl_example("type-me.xlsx"), sheet = "date_coercion",
col_types = c("date", "text"))
Upvotes: 0
Reputation: 39154
The answers provided by others using dplyr
should work. However, if for some reasons dplyr
is not working. Here is a solution using tapply
from base R.
dt <- data.frame(Date = unique(NWS_temps1$Date),
Tmax = tapply(NWS_temps1$Temperature_F, NWS_temps1$Date, FUN = max),
Tmin = tapply(NWS_temps1$Temperature_F, NWS_temps1$Date, FUN = min))
Upvotes: 1
Reputation: 11603
Looks like you are using the standard evaluation version group_by_()
instead of the NSE version group_by()
. Try it without the underscore:
NWS_temps1 %>%
group_by(Date) %>%
summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))
#> # A tibble: 2 x 3
#> Date Tmax Tmin
#> <chr> <dbl> <dbl>
#> 1 2017-05-18 72 35
#> 2 2017-05-19 77 39
Upvotes: 1
Reputation: 3994
NWS_temps1 %>%
group_by(as.character(Date)) %>%
summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))
Upvotes: 1