ktPeshek
ktPeshek

Reputation: 5

How do I group 90th percentiles by two columns

I'm working with a large dataframe 7191 obs. of 19 variables. The columns are Month, Day, Year, and Site1 through Site16. Where Month is either June, July, August, September, or October.

Here is the beginning of my data, which I believe has only numerals in the columns site1-site16. Currently, I'm double checking to make sure.

dput(head(No_PS_for_Calculations ))
structure(list(Month = c("June", "June", "June", "June", "June", 
"June"), Day = c(1, 2, 3, 4, 5, 6), Year = c(1970, 1970, 1970, 
1970, 1970, 1970), Site1 = c("11.531", "12.298", "12.732", "12.619", 
"12.5", "13.201"), Site2 = c("11.185", "11.439", "12.17", "12.432", 
"12.337", "12.492"), Site3 = c("11.147", "11.496", "11.645", 
"12.208", "12.644", "12.971"), Site4 = c("11.393", "11.707", 
"11.961", "12.135", "12.809", "13.041"), Site5 = c("11.797", 
"11.925", "12.34", "12.525", "13.01", "13.548"), Site6 = c("11.853", 
"11.974", "12.16", "12.481", "12.459", "12.838"), Site7 = c("12.319", 
"12.46", "12.476", "12.729", "13.026", "13.032"), Site8 = c("12.557", 
"12.643", "12.789", "12.975", "13.202", "13.339"), Site9 = c("12.774", 
"13.337", "13.896", "13.897", "13.819", "14.054"), Site10 = c("12.819", 
"13.202", "13.783", "14.298", "14.284", "14.309"), Site11 = c("13.151", 
"13.556", "13.833", "14.08", "14.244", "14.841"), Site12 = c("13.61", 
"13.57", "14.111", "14.073", "14.331", "14.849"), Site13 = c("13.802", 
"13.872", "14.244", "14.249", "14.255", "14.818"), Site14 = c("14.138", 
"14.275", "14.332", "14.522", "14.244", "14.927"), Site15 = c("14.138", 
"14.616", "14.766", "14.697", "14.61", "14.694"), Site16 = c("14.208", 
"14.627", "14.928", "14.829", "14.69", "14.762")), row.names = 151:156, class = "data.frame")

For my analysis I am interested in finding the 90th percentile for each month in each year. For example for 1970, I need the 90th percentile for June, July, August, September, and October. I've tried a few different ways but keep getting stuck in the same spot so I thought I'd ask for help.

result <- No_PS_for_Calculations %>% 
  group_by(Year, Month) %>%
  summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
data.frame(result)

Which results in the following error:

Error: Problem with `summarise()` input `..1`.
i `..1 = across(Site1:Site16, quantile, probs = 0.9, .names = "percent90_{col}")`.
x non-numeric argument to binary operator
i The error occurred in group 1: Year = 1970, Month = "August".

I've been able to find the percentile grouped by month but now need to include year for further analysis.

What is the best way to get the 90th Percentiles presented by year and then month?

Thanks for the help!

Upvotes: 0

Views: 110

Answers (1)

r2evans
r2evans

Reputation: 160437

It seems likely that you have something non-numeric in a column between Site1 and Site16. Some fake data:

set.seed(42)
No_PS_for_Calculations <- data.frame(Year = rep(2020:2021, each = 3), Month = rep(c("Aug","Sep","Oct"), times = 2), Site1 = runif(6), Quux = sprintf("%0.03f", runif(6)), Site16 = runif(6))
No_PS_for_Calculations
#   Year Month     Site1  Quux    Site16
# 1 2020   Aug 0.9148060 0.737 0.9346722
# 2 2020   Sep 0.9370754 0.135 0.2554288
# 3 2020   Oct 0.2861395 0.657 0.4622928
# 4 2021   Aug 0.8304476 0.705 0.9400145
# 5 2021   Sep 0.6417455 0.458 0.9782264
# 6 2021   Oct 0.5190959 0.719 0.1174874

No_PS_for_Calculations %>% 
  group_by(Year, Month) %>%
  summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
+ > Error: Problem with `summarise()` input `..1`.
# x non-numeric argument to binary operator
# i Input `..1` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
# i The error occurred in group 1: Year = 2020, Month = "Aug".

If the non-numeric data ("Quux" column here) is not meant to be summarized, then you can select the columns you need to avoid any confusion:

No_PS_for_Calculations %>% 
  select(Year, Month, starts_with("Site")) %>%
  group_by(Year, Month) %>%
  summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
# # A tibble: 6 x 4
# # Groups:   Year [2]
#    Year Month percent90_Site1 percent90_Site16
#   <int> <chr>           <dbl>            <dbl>
# 1  2020 Aug             0.915            0.737
# 2  2020 Oct             0.286            0.657
# 3  2020 Sep             0.937            0.135
# 4  2021 Aug             0.830            0.705
# 5  2021 Oct             0.519            0.719
# 6  2021 Sep             0.642            0.458

Another cause might be if a legitimate Site column is non-numeric, in which case you need to determine if you can easily convert to numeric. For instance, if "Quux" here is instead named "Site2"

names(No_PS_for_Calculations)[4] <- "Site2"

then we can try to convert it inline:

No_PS_for_Calculations %>%
  mutate(Site2 = as.numeric(Site2)) %>%
  group_by(Year, Month) %>%
  summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
# # A tibble: 6 x 5
# # Groups:   Year [2]
#    Year Month percent90_Site1 percent90_Site2 percent90_Site16
#   <int> <chr>           <dbl>           <dbl>            <dbl>
# 1  2020 Aug             0.915           0.737            0.935
# 2  2020 Oct             0.286           0.657            0.462
# 3  2020 Sep             0.937           0.135            0.255
# 4  2021 Aug             0.830           0.705            0.940
# 5  2021 Oct             0.519           0.719            0.117
# 6  2021 Sep             0.642           0.458            0.978

Of course, if there are non-number characters in there, you will get NAs, which is easily fixed given filters, cleaners, or similar.

Upvotes: 1

Related Questions