wesleysc352
wesleysc352

Reputation: 617

how to average a set of columns and exclude other specific columns in R using the summarise command?

I'm breaking my head here with academic work. I have a data.frame with several numeric columns. I am using the command summarize and group_by in R to perform the average calculations of my data frame.

I tried with the code summarize (across (where (is.numeric), mean), -c(Mes, year_date), but it calculates the average of the entire data.frame and in addition, it creates a new column -c (Mes, year_date)), I would like some numeric columns to be excluded from the media calculation, but continue on the data.frame.

Note that I tried -c(Mes, year_date) to exclude these two columns from the average calculation, but it didn't work.

I tried

library(tidyr)
library(dplyr)
library(lubridate)

sample_station <-c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','C','C','C','C','C','C','C','C','C','C','A','B','C','A','B','C')
Date_dmy <-c('01/01/2000','08/08/2000','16/03/2001','22/09/2001','01/06/2002','05/01/2002','26/01/2002','16/02/2002','09/03/2002','30/03/2002','20/04/2002','04/01/2000','11/08/2000','19/03/2001','25/09/2001','04/06/2002','08/01/2002','29/01/2002','19/02/2002','12/03/2002','13/09/2001','08/01/2000','15/08/2000','23/03/2001','29/09/2001','08/06/2002','12/01/2002','02/02/2002','23/02/2002','16/03/2002','06/04/2002','01/02/2000','01/02/2000','01/02/2000','02/11/2001','02/11/2001','02/11/2001')
temperature <-c(17,20,24,19,17,19,23,26,19,19,21,15,23,18,22,22,23,18,19,26,21,22,23,27,19,19,21,23,24,25,26,29,30,21,25,24,23)
wind_speed<-c(3.001,6.332,9.321,10.9091,6.38,10.5882,10.5,10.4348,10.3846,10.3448,10.3125,8.35,10.2632,10.2439,10.2273,10.2128,10.2,10.1887,10.1786,12,10.1613,10.1538,10.1471,10.1408,10.1351,10.1299,10.125,2.36,10.1163,10.1124,10.1087,11.2,10.102,10.099,10.0962,10.0935,10.0909)
esp<-c(11.6,11.3,11,10.7,10.4,10.1,9.8,9.5,9.2,8.9,8.6,8.3,8,11.2,10.9,10.6,10.3,10,12.8,12.5,12.2,11.9,11.6,11.3,11,4.36,4.06,3.76,3.46,3.16,2.86,2.56,2.26,1.96,1.66,1.36,23)
volum<-c(300,300,300,300,300,300,300,300,250,250,250,250,250,250,400,400,400,400,400,105,105,105,105,105,105,105,105,105,105,81,81,81,81,81,81,81,81)

df<-data.frame(sample_station, Date_dmy, temperature, wind_speed, esp, volum)%>%
  mutate(Date_dmy = dmy(Date_dmy)) %>%
  mutate(year_date = floor_date(Date_dmy,'year'))%>%
  mutate(Ano=year(Date_dmy))%>%
  mutate(Mes=month(Date_dmy))%>%
  mutate(Epoca = ifelse(Mes %in% 4:9,'dry','rainy'))%>%
  group_by(sample_station, Epoca, Ano)%>%
  summarise(across(where(is.numeric), mean), -c(Mes, year_date))

I have several columns that I don't want to be averaged (even if they are numeric). For exemple, columns esp and volum.

update

Exit expectation

enter image description here

Upvotes: 0

Views: 960

Answers (2)

hugh-allan
hugh-allan

Reputation: 1370

Because you are summarising only part of the data, you need to specify what data (rows) of the un-summarised data you want to maintain. In your example, you don't want to summarise Mes and year_date, however you have multiple values within each group (sample_station, Epoca, Ano), of these Mes and year_date columns.

Which values of these unsummarised columns do you want to keep?

If you want to keep all values of the unsummarised columns, you may want to include Mes and year_date inside group_by(sample_station, Epoca, Ano) before summarising.

Alternatively, you may use mutate() rather than summarise() to get summary values in a new column for each row of the original dataframe, then choose your rows from there.

Update: Again, with your edited post including desired output, what values do you expect for Mes. For example, when sample_station == 'A', Epoca == 'rainy' and Ano == 2000, you have values for Mes of 1 & 2, and the same year_date. summarise() wants to calculate one single summary value for this group.

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388982

You can use across(c(where(is.numeric), -Mes). Note that year_date is not included in the calculation as it is not of class numeric and also because it is included in group_by.

You can also combine multiple mutate statements into one.

If you want to exclude certain columns from the average calculation but want to keep it in the dataframe you need to decide which value do you want to keep. For example, to keep the 1st value you can use first.

library(dplyr)
library(lubridate)

data.frame(sample_station, Date_dmy, temperature, wind_speed)%>%
  mutate(Date_dmy = dmy(Date_dmy),
         year_date = floor_date(Date_dmy,'year'),
         Ano=year(Date_dmy),
         Mes=month(Date_dmy),
         Epoca = ifelse(Mes %in% 4:9,'dry','rainy')) %>%
  group_by(sample_station, year_date, Epoca) %>%
  summarise(across(c(where(is.numeric), -Mes), mean),
            across(Mes, first))

Upvotes: 0

Related Questions