Kevin Purves
Kevin Purves

Reputation: 15

Cumulative percentage per column of a dataframe

I have a data.frame made up of daily temperatures per julian day for a number of stations.

Minimal reproducible example data.frame:

TemperatureData <- data.frame(
    Julian_Day = 1:365,
    Station_1 = c(rnorm(1:365, mean=10, sd=2)),
    Station_2 = c(rnorm(1:365, mean=10, sd=2)),
    Station_3 = c(rnorm(1:365, mean=10, sd=2))
)

I would like to determine the julian day on which each station exceeds a stated percentage of the total cumulative value, and for an output stating the julian day when this cumulative value threshold is reached for each station.

For example, say Station 1 has a total value of 4000, and after 180 julian days the cumulative value exceeds a set 50% threshold of the total value, and is repeated for each column of the data.frame (example of preferred output below).

Station_1   Station_2   Station_3
180         183         179

I assume this would utilise the cumsum function in some capacity, but not sure how to implement it. Can anyone help with this?

Let me know if this doesn't make sense.

Upvotes: 1

Views: 855

Answers (2)

Vincent
Vincent

Reputation: 17868

Base R solution:

TemperatureData <- data.frame(
    Julian_Day = 1:365,
    Station_1 = c(rnorm(1:365, mean=10, sd=2)),
    Station_2 = c(rnorm(1:365, mean=10, sd=2)),
    Station_3 = c(rnorm(1:365, mean=10, sd=2))
)

TemperatureData$Station_1 <- cumsum(TemperatureData$Station_1) / sum(TemperatureData$Station_1)
TemperatureData$Station_2 <- cumsum(TemperatureData$Station_2) / sum(TemperatureData$Station_2)
TemperatureData$Station_3 <- cumsum(TemperatureData$Station_3) / sum(TemperatureData$Station_3)


results <- c(
  "Station 1" = TemperatureData$Julian_Day[TemperatureData$Station_1 >= .5][1],
  "Station 2" = TemperatureData$Julian_Day[TemperatureData$Station_2 >= .5][1],
  "Station 3" = TemperatureData$Julian_Day[TemperatureData$Station_3 >= .5][1]
)
results
#> Station 1 Station 2 Station 3 
#>       180       185       183

tidyverse solution:

library(dplyr)
TemperatureData %>%
  summarize(across(matches("Station"), 
                   function(x) Julian_Day[cumsum(x) / sum(x) > .5][1]))

data.table solution:

library(data.table)

setDT(TemperatureData)

TemperatureData[, lapply(.SD, function(x) Julian_Day[cumsum(x) / sum(x) > .5][1]), 
                .SDcols=patterns("Station")]

Upvotes: 3

Ben Norris
Ben Norris

Reputation: 5747

Here is a tidyverse way to do it. I figure there is a simpler method, and if I figure it out, I will post it.

library(dplyr)
library(tidyr)
TemperatureData %>% 
  pivot_longer(cols = -Julian_Day, names_to = "Station") %>%
  group_by(Station) %>%
  arrange(Station, Julian_Day) %>%
  mutate(cumpct = cumsum(value) / sum(value)) %>%
  filter(cumpct >= 0.5) %>%
  slice(1) %>%
  pivot_wider(id_cols = 1, names_from = Station, values_from = Julian_Day)

# A tibble: 1 x 3
  Station_1 Station_2 Station_3
      <int>     <int>     <int>
1       184       181       181

Upvotes: 1

Related Questions