Reputation: 4106

dplyr- cumulative sum of a column, based on comparisons with other columns

I'm reading in some XY data from a file, and need to do a cumulative sum on the Y ordinate with steps coarser than that in the original data.

How can I do a cumsum on a the Y column based on X values in other columns?

library(tidyverse)
x_data <- seq(2, 5, 0.2) # this is read in from a file
y_data <- runif(length(x_data)) # this is read in from a file
input_data <- tibble(x_data, y_data)

x_steps <- seq(2, 5, 1)
final_data = tibble(x_steps)

#how to cumsum input_data$y_data based on final_data$x_steps?
#x_data and x_steps are guarenteed to be strictly increasing

*edit: Some example data:

# input_data
2.0   0.55005588
2.2   0.27577714
2.4   0.24215415
2.6   0.70794255
2.8   0.30632029
3.0   0.71693474
3.2   0.49712663
3.4   0.21227011
3.6   0.74823731
3.8   0.90176957
4.0   0.52331898
4.2   0.79409880
4.4   0.04228391
4.6   0.06220941
4.8   0.50969235
5.0   0.11734668

should turn into:

#final_data
2.0   0.55005588
3.0   2.79918475
4.0   5.68190735
5.0   7.2075385

Upvotes: 2

Answers (5)

TarJae

Reputation: 79174

Another tidyverse approach: Last 2 lines same as @deschen:

Main challenge is to group for summing. Here we use a combination of coalesce, str_extract and parse_number to group:

library(tidyverse)

input_data %>% 
  group_by(id = coalesce(parse_number(str_extract(x_data, '\\d+\\.'))+1, x_data)) %>% 
  summarise(sum = sum(y_data)) %>% 
  mutate(cumsum_ydata = cumsum(sum))

      id    sum cumsum_ydata
   <dbl>  <dbl>        <dbl>
 1     2  0.874        0.874
 2     3 49.2         50.1  
 3     4 48.7         98.8  
 4     5 45.0        144.   
 5     6 48.2        192.   
 6     7 49.1        241.   
 7     8 47.4        289.   
 8     9 46.1        335.   
 9    10 50.8        385.   
10    11 51.1        437.   
# ... with 154 more rows

Upvotes: 1

masher

Reputation: 4106

This is what I came up with:

x_steps <- seq(2, 5, 1)
final_data <-  tibble(x_steps)
final_data <- final_data  %>% mutate(id=cut(x_steps, breaks=c(-Inf, x_steps), labels=seq(2, 5, 1)))

x_data <- seq(2, 5, 0.2) # this is really read in from a file
y_data <- runif(length(x_data)) # this is really read in from a file
input_data <- tibble(x_data, y_data)

final_data <- input_data %>% 
  mutate(id=cut(x_data, breaks=c(-Inf, x_steps), labels=seq(2, 5, 1))) %>% 
  group_by(id) %>% 
  summarise(totals=sum(y_data)) %>% 
  mutate(csum = cumsum(totals)) %>% 
  select(c("id", "csum")) %>% 
  right_join(final_data, by="id") %>% 
  relocate(x_steps)

I stuck with the explicit ID column, as I wanted to join it back to the final_data tibble, along with other calculations.

Part my own deductions, part taking from @deschen

Upvotes: 0

jpsmith

Reputation: 17495

Though not the most elegant solution, if you wanted/neded a brute-force base R approach:

want <- matrix(NA, ncol = 2, nrow = length(x_steps))
for(xx in seq_along(x_steps)){
  res <- seq(x_steps[xx], x_steps[xx] + 1, 0.0001)
  want[xx,] <- c(x_steps[xx], 
                 max(cumsum(input_data[input_data$x_data %in% res[-length(res)], "y_data"])))
}

Output:

#>want
#[,1]     [,2]
#[1,]    2 3.308344
#[2,]    3 1.581712
#[3,]    4 2.660372
#[4,]    5 0.899825

If the steps are truly lower resolution, an easier dplyr() approach may be to use floor() then perform some basic functions:


input_data %>% 
  group_by(floor(x_data)) %>% 
  mutate(cum_sum = cumsum(y_data)) %>%
  filter(row_number() == n()) %>%
  select(cum_sum)

Output:

  `floor(x_data)` cum_sum
            <dbl>   <dbl>
1               2   3.31 
2               3   2.47 
3               4   2.76 
4               5   0.900

Upvotes: 1

deschen

Reputation: 11006

Alternative solution:

input_data %>%
  mutate(id = as.integer(cut(x_data, breaks = 0:max(x_data)))) %>%
  group_by(id) %>%
  summarize(sum = sum(y_data)) %>%
  mutate(cumsum_ydata = cumsum(sum))

Upvotes: 2

Mohamed Desouky

Reputation: 4425

We can try

library(tidyverse)

input_data |>
           mutate(x_steps = rev(trunc(x_data))) |>
           group_by(x_steps) |> 
           summarise(cum_y_data = sum(y_data)) |>
           mutate(cum_y_data = rev(cum_y_data)) |> 
           mutate(cum_y_data = cumsum(cum_y_data))

output

# A tibble: 4 × 2
  x_steps cum_y_data
    <dbl>      <dbl>
1       2      0.550
2       3      2.80 
3       4      5.68 
4       5      7.21

Upvotes: 2

dplyr- cumulative sum of a column, based on comparisons with other columns

Answers (5)

Related Questions