Reputation: 433
Suppose I have the following data and data frame:
sample_data <- c(1:14)
sample_data2 <- c(NA,NA,NA, "break", NA, NA, "break", NA,NA,NA,NA,NA,NA,"break")
sample_df <- as.data.frame(sample_data)
sample_df$sample_data2 <- sample_data2
When I print this data frame, the results are as follows:
sample_data sample_data2
1 1 <NA>
2 2 <NA>
3 3 <NA>
4 4 break
5 5 <NA>
6 6 <NA>
7 7 break
8 8 <NA>
9 9 <NA>
10 10 <NA>
11 11 <NA>
12 12 <NA>
13 13 <NA>
14 14 break
How would I program it so that at every "break", it outputs the max from that row up? For instance, I would want the code to output the set of (4,7,14). Additionally, I would want it so that it only finds the max value between up to the next "break" interval. I apologize in advance if I used any incorrect nomenclature.
Upvotes: 2
Views: 282
Reputation: 40171
Depending whether you want to assess the maximum "sample_data" number between all "sample_data2" == break including (e.g. row 1 to row 4) or excluding (e.g. row 1 to row 3) the given "sample_data2" == break row, you can do something like this with tidyverse
:
Excluding the break rows:
sample_df %>%
group_by(sample_data2) %>%
mutate(temp = ifelse(is.na(sample_data2), NA_character_, paste0(gl(length(sample_data2), 1)))) %>%
ungroup() %>%
fill(temp, .direction = "up") %>%
filter(is.na(sample_data2)) %>%
group_by(temp) %>%
summarise(res = max(sample_data))
temp res
<chr> <dbl>
1 1 3.
2 2 6.
3 3 13.
Including the break rows:
sample_df %>%
group_by(sample_data2) %>%
mutate(temp = ifelse(is.na(sample_data2), NA_character_, paste0(gl(length(sample_data2), 1)))) %>%
ungroup() %>%
fill(temp, .direction = "up") %>%
group_by(temp) %>%
summarise(res = max(sample_data))
temp res
<chr> <dbl>
1 1 4.
2 2 7.
3 3 14.
Both of the codes create an ID variable called "temp" using gl()
for "sample_data2" == break and then fill up the NA rows with that ID. Then, the first code filters out the "sample_data2" == break rows and assess the maximum "sample_data" values per group, while the second assess the maximum "sample_data" values per group including the "sample_data2" == break rows.
Upvotes: 1
Reputation: 78
Looks like there are lots of different ways of doing this. This is how I went about it:
rows <- which(sample_data2 == "break") #Get the row indices for where "break" appears
findmax <- function(maxrow) {
max(sample_data[1:maxrow])
} #Create a function that returns the max "up to" a given row
sapply(rows, findmax) #apply it for each of your rows
### [1] 4 7 14
Note that this works "up to" the given row. To get the maximum value between the two breaks would probably be easier with one of the other solutions, but you could also do it by looking at the j-1 row to jth row from the rows
object.
Upvotes: 1
Reputation: 76651
Here are 2 ways with base R. The trick is to define a grouping variable, grp
.
grp <- !is.na(sample_df$sample_data2) & sample_df$sample_data2 == "break"
grp <- rev(cumsum(rev(grp)))
grp <- -1*grp + max(grp)
tapply(sample_df$sample_data, grp, max, na.rm = TRUE)
aggregate(sample_data ~ grp, sample_df, max, na.rm = TRUE)
Data.
This is simplified data creation code.
sample_data <- 1:14
sample_data2 <- c(NA,NA,NA, "break", NA, NA, "break", NA,NA,NA,NA,NA,NA,"break")
sample_df <- data.frame(sample_data, sample_data2)
Upvotes: 1
Reputation: 12470
I construct the groups looking for the word "break" and then move the results one row up. Then some dplyr
commands to get max of every group.
library(dplyr)
sample_df_new <- sample_df %>%
mutate(group = c(1, cumsum(grepl("break", sample_data2)) + 1)[1:length(sample_data2)]) %>%
group_by(group) %>%
summarise(group_max = max(sample_data))
> sample_df_new
# A tibble: 3 x 2
group group_max
<dbl> <dbl>
1 1 4
2 2 7
3 3 14
Upvotes: 2
Reputation: 5673
I have an answer using data.table
:
library(data.table)
sample_df <- setDT(sample_df)
sample_df[,group := (rleid(sample_data2)-0.5)%/%2]
sample_df[,.(maxvalues = max(sample_data)),by = group]
group maxvalues
1: 0 4
2: 1 7
3: 2 14
The tricky part is (rleid(sample_data2)-0.5)%/%2
: rleid
create an increasing index to each change :
sample_data sample_data2 rleid
1: 1 NA 1
2: 2 NA 1
3: 3 NA 1
4: 4 break 2
5: 5 NA 3
6: 6 NA 3
7: 7 break 4
8: 8 NA 5
9: 9 NA 5
10: 10 NA 5
11: 11 NA 5
12: 12 NA 5
13: 13 NA 5
14: 14 break 6
If you keep the entire part of that index - 0.5, you have a constant index for the rows you want, that you can use for grouping operation:
sample_data sample_data2 group
1: 1 NA 0
2: 2 NA 0
3: 3 NA 0
4: 4 break 0
5: 5 NA 1
6: 6 NA 1
7: 7 break 1
8: 8 NA 2
9: 9 NA 2
10: 10 NA 2
11: 11 NA 2
12: 12 NA 2
13: 13 NA 2
14: 14 break 2
Then it is just taking the maximum for each group. You can easily translate it into dplyr
if it is easier for you
Upvotes: 2