Reputation: 10199
I have range of numbers with scores. I want to calculate average of fixed length ranges between rows.(find fixed ranges and averaging)
let's say: df1:
start end score
100 105 2.2
105 110 2.4
110 115 2.6
130 135 1.4
135 140 1.2
200 205 5.0
205 210 5.8
210 215 5.4
I want to make such table: df2
start end avg
100 115 2.4
130 140 1.3
200 215 5.4
The fixed range refers to difference between rows;ex: In the first 3 rows of df1
the differenece is 5 (100-105-110)
and then in row 4
it jumps into 130
. Thus, the detected continuous range is from 100 to 115
)
I would thankful if you tell me how can I do it in R.
Upvotes: 0
Views: 179
Reputation: 887571
One way would be to try dplyr
. Here, we create a grouping variable grp
based on whether the difference between current and succeeding row of start
is greater than 5
or not (diff(start)>5
) using cumsum
after concatenating with TRUE
as the diff
output would be shorter in length by 1
than the nrow(df)
. After we did the grouping group_by(...)
, use summarise
and select the first value of start
(start[1]
) per grp
, last value of end
(end[n()]
) and the mean
of score. n()
gives the nrow(.)
for the grouped subsets.
library(dplyr)
df %>%
group_by(grp=cumsum(c(TRUE, diff(start)>5))) %>%
summarise(start=start[1], end=end[n()], score=mean(score))%>%
select(-grp) #as suggested by @MrFlick
# start end score
#1 100 115 2.4
#2 130 140 1.3
#3 200 215 5.4
Or using data.table
. Here, also, we are summarising the results in a list
. Instead of n()
, the syntax is .N
.
library(data.table)
setDT(df)[,list(start=start[1], end=end[.N], score=mean(score)) ,
by=list(grp=cumsum(c(TRUE, diff(start)>5)))]
Or using base R
do.call(rbind,by(df, list(grp=cumsum(c(TRUE, diff(df$start)>5))),
FUN= function(x) with(x,c(start=start[1], end=end[nrow(x)],
score=mean(score)))))
# start end score
#1 100 115 2.4
#2 130 140 1.3
#3 200 215 5.4
df <- structure(list(start = c(100L, 105L, 110L, 130L, 135L, 200L,
205L, 210L), end = c(105L, 110L, 115L, 135L, 140L, 205L, 210L,
215L), score = c(2.2, 2.4, 2.6, 1.4, 1.2, 5, 5.8, 5.4)), .Names = c("start",
"end", "score"), class = "data.frame", row.names = c(NA, -8L))
Upvotes: 5