Cina
Cina

Reputation: 10199

grouping and averaging fixed range of numbers in R

I have range of numbers with scores. I want to calculate average of fixed length ranges between rows.(find fixed ranges and averaging)
let's say: df1:

start   end    score
100     105     2.2
105     110     2.4
110     115     2.6
130     135     1.4
135     140     1.2
200     205     5.0
205     210     5.8
210     215     5.4

I want to make such table: df2

start   end   avg
100     115   2.4
130     140   1.3
200     215   5.4

The fixed range refers to difference between rows;ex: In the first 3 rows of df1 the differenece is 5 (100-105-110) and then in row 4 it jumps into 130. Thus, the detected continuous range is from 100 to 115)

I would thankful if you tell me how can I do it in R.

Upvotes: 0

Views: 179

Answers (1)

akrun
akrun

Reputation: 887571

One way would be to try dplyr. Here, we create a grouping variable grp based on whether the difference between current and succeeding row of start is greater than 5 or not (diff(start)>5) using cumsum after concatenating with TRUE as the diff output would be shorter in length by 1 than the nrow(df). After we did the grouping group_by(...), use summarise and select the first value of start (start[1]) per grp, last value of end (end[n()]) and the mean of score. n() gives the nrow(.) for the grouped subsets.

library(dplyr)
df %>% 
   group_by(grp=cumsum(c(TRUE, diff(start)>5))) %>%
   summarise(start=start[1], end=end[n()], score=mean(score))%>%
   select(-grp) #as suggested by @MrFlick
#  start end score
#1   100 115   2.4
#2   130 140   1.3
#3   200 215   5.4

Or using data.table. Here, also, we are summarising the results in a list. Instead of n(), the syntax is .N.

library(data.table)
setDT(df)[,list(start=start[1], end=end[.N], score=mean(score)) ,
                        by=list(grp=cumsum(c(TRUE, diff(start)>5)))]

Or using base R

 do.call(rbind,by(df, list(grp=cumsum(c(TRUE, diff(df$start)>5))), 
      FUN= function(x) with(x,c(start=start[1], end=end[nrow(x)],
                                          score=mean(score)))))
 #   start end score
 #1   100 115   2.4
 #2   130 140   1.3
 #3   200 215   5.4

data

df <- structure(list(start = c(100L, 105L, 110L, 130L, 135L, 200L, 
 205L, 210L), end = c(105L, 110L, 115L, 135L, 140L, 205L, 210L, 
 215L), score = c(2.2, 2.4, 2.6, 1.4, 1.2, 5, 5.8, 5.4)), .Names = c("start", 
 "end", "score"), class = "data.frame", row.names = c(NA, -8L))

Upvotes: 5

Related Questions