Reputation: 1577
I have a simple data set with the following form
df<- data.frame(c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),
c(80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90, 80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90),
c(1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60, 25, 75, 20, 40, 5, 5, 2, 4, 6, 5, 2, 1, 2, 3, 4, 6, 2, 7, 2, 4))
colnames(df)<-c("car_number", "year", "marker", "val")
What I am trying to do is quite simple, actually: Per car_number
, I want to find the average of the values associated with a marker
-value and the preceeding 3 values. So for the example data above the output I want is
car=10, year=80 1: 50
car=10, year=80 2: 40
car=10, year=80 3: 45
car=10, year=80 4: 37.5
car=10, year=90 1: 31.25
car=10, year=90 2: 36.25
car=10, year=90 3: 35
car=10, year=90 4: 38.75
car=20, year=80 1: 5
car=20, year=80 2: 4
car=20, year=80 3: 4.5
car=20, year=80 4: 3.75
car=20, year=90 1: 3.125
car=20, year=90 2: 3.625
car=20, year=90 3: 3.375
car=20, year=90 4: 3.750
Note that for simplicity of the example the markers
above come in pairs of two. That is not the case with the real data, so I am thinking a general solution will contain some sort of group_by
(?)
Any efficient solution is welcome!
Here is a second example data set and output:
df<- data.frame(c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),
c(80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90, 80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90),
c(1, 2, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 3, 4, 1, 1, 1, 2, 3, 3, 4, 4, 4, 1, 2, 2, 3, 3, 3, 4),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60, 25, 75, 20, 40, 5, 5, 2, 4, 6, 5, 2, 1, 2, 3, 4, 6, 2, 7, 2, 4))
colnames(df)<-c("car_number", "year", "marker", "val")
And the output is (based on the rules above)
car=10, year=80 1: Mean{{25}] = 25
car=10, year=80 2: Mean[{40, 20, 75, 25}] = 40
car=10, year=80 3: Mean[{50, 60, 40, 20, 75, 25}] = 45
car=10, year=80 4: Mean[{10, 20, 50, 60, 40, 20, 75, 25}] = 37.5
car=10, year=90 1: Mean[{30, 20, 10, 20, 50, 60, 40, 20, 75}] = 36.11
car=10, year=90 2: Mean[{60, 40, 30, 20, 10, 20, 50, 60}] = 36.25
car=10, year=90 3: Mean[{20, 75, 25, 60, 40, 30, 20, 10, 20}] = 33.33
car=10, year=90 4: Mean[{40, 20, 75, 25, 60, 40, 30, 20}] = 38.75
car=20, year=80 1: Mean[{2, 5, 5}] = 4
car=20, year=80 2: Mean[{4, 2, 5, 5}] = 4
car=20, year=80 3: Mean[{5, 6, 4, 2, 5, 5}] = 4.5
car=20, year=80 4: Mean[{2, 1, 2, 5, 6, 4, 2, 5, 5}] = 3.55
car=20, year=90 1: Mean[{3, 2, 1, 2, 5, 6, 4}] = 3.29
car=20, year=90 2: Mean[{6, 4, 3, 2, 1, 2, 5, 6}] = 3.625
car=20, year=90 3: Mean[{2, 7, 2, 6, 4, 3, 2, 1, 2}] = 3.22
car=20, year=90 4: Mean[{4, 2, 7, 2, 6, 4, 3}] = 4
Upvotes: 3
Views: 165
Reputation: 2496
considering df
as your input, you can use dplyr
and zoo
and try:
grouping only over car_number
, you can try:
df %>%
group_by(car_number, year, marker) %>%
summarise(mm = mean(val)) %>%
group_by(car_number) %>%
mutate(rM=rollapply(mm, if_else(row_number() < 4, marker, 4), FUN=mean, align="right"))%>%
select(year, rM)
which gives:
Source: local data frame [16 x 3]
Groups: car_number [2]
car_number year rM
<dbl> <dbl> <dbl>
1 10 80 50.000
2 10 80 40.000
3 10 80 45.000
4 10 80 37.500
5 10 90 31.250
6 10 90 36.250
7 10 90 35.000
8 10 90 38.750
9 20 80 5.000
10 20 80 4.000
11 20 80 4.500
12 20 80 3.750
13 20 90 3.125
14 20 90 3.625
15 20 90 3.375
16 20 90 3.750
Upvotes: 0
Reputation: 5704
You can do it this way:
df %>%
group_by(car_number, year, marker) %>%
summarise(s = sum(val), w = n()) %>% # sum and number of values
group_by(car_number) %>%
mutate(S = cumsum(s) - cumsum(lag(s, 4, default=0))) %>% # sum of last four s
mutate(W = cumsum(w) - cumsum(lag(w, 4, default=0))) %>% # same for the weights
mutate(result = S/W)
Output of your second example:
# Source: local data frame [16 x 8]
# Groups: car_number [2]
#
# car_number year marker s w S W result
# <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl>
# 1 10 80 1 25 1 25 1 25.000000
# 2 10 80 2 135 3 160 4 40.000000
# 3 10 80 3 110 2 270 6 45.000000
# 4 10 80 4 30 2 300 8 37.500000
# 5 10 90 1 50 2 325 9 36.111111
# 6 10 90 2 100 2 290 8 36.250000
# 7 10 90 3 120 3 300 9 33.333333
# 8 10 90 4 40 1 310 8 38.750000
# 9 20 80 1 12 3 12 3 4.000000
# 10 20 80 2 4 1 16 4 4.000000
# 11 20 80 3 11 2 27 6 4.500000
# 12 20 80 4 5 3 32 9 3.555556
# 13 20 90 1 3 1 23 7 3.285714
# 14 20 90 2 10 2 29 8 3.625000
# 15 20 90 3 11 3 29 9 3.222222
# 16 20 90 4 4 1 28 7 4.000000
Edit:
It's probably more efficient with package RcppRoll
, you can try that: S = roll_sum(c(0, 0, 0, s), 4)
(and same for W
).
Upvotes: 0
Reputation: 38500
Here is a method with data.table
that modifies Frank's suggestion in David Arenburg's answer here.
# aggregate data by car_number, year, and marker
dfNew <- setDT(df)[, .(val=mean(val)), by=.(car_number, year, marker)]
# calculate average of current a previous three values
dfNew[, val := rowMeans(dfNew[,shift(val, 0:3), by=car_number][, -1], na.rm=TRUE)]
The first line is a standard aggregation call. The second line makes some changes to the rowMeans
method in the linked answer. rowMeans
is fed a data.table of the shifted values, where the shift occurs by car_number (thanks to sotos for catching this), which is chained to a statement that drops the first column (using -1), which is the car_number column returned in the first part of the chain.
this returns
car_number year marker val
1: 10 80 1 50.000
2: 10 80 2 40.000
3: 10 80 3 45.000
4: 10 80 4 37.500
5: 10 90 1 31.250
6: 10 90 2 36.250
7: 10 90 3 35.000
8: 10 90 4 38.750
9: 20 80 1 5.000
10: 20 80 2 4.000
11: 20 80 3 4.500
12: 20 80 4 3.750
13: 20 90 1 3.125
14: 20 90 2 3.625
15: 20 90 3 3.375
16: 20 90 4 3.750
Upvotes: 2
Reputation: 51582
Just throwing a base R solution in the mix. We can make a custom function using Reduce
with accumulate = TRUE
and tail(x, 4)
to ensure that only last 3 observations will be included. All these after we average the data set by car_type, year, marker
, i.e.
f1 <- function(x){
sapply(Reduce(c, x, accumulate = TRUE), function(i) mean(tail(i,4)))
}
dd <- aggregate(val ~ car_number+year+marker, df, mean)
dd <- dd[order(dd$car_number, dd$year, dd$marker),]
dd$new_avg <- with(dd, ave(val, car_number, FUN = f1))
dd
# car_number year marker val new_avg
#1 10 80 1 50.0 50.000
#5 10 80 2 30.0 40.000
#9 10 80 3 55.0 45.000
#13 10 80 4 15.0 37.500
#3 10 90 1 25.0 31.250
#7 10 90 2 50.0 36.250
#11 10 90 3 50.0 35.000
#15 10 90 4 30.0 38.750
#2 20 80 1 5.0 5.000
#6 20 80 2 3.0 4.000
#10 20 80 3 5.5 4.500
#14 20 80 4 1.5 3.750
#4 20 90 1 2.5 3.125
#8 20 90 2 5.0 3.625
#12 20 90 3 4.5 3.375
#16 20 90 4 3.0 3.750
Upvotes: 2
Reputation: 12819
A first group_by
computes the mean by car_number
, year
, marker
, and retains its weight (number of rows).
A second group_by
by car_number
allows us to retrieve lag
ging means and weights to compute the desired mean:
library(purrr)
library(dplyr)
df %>%
arrange(car_number, year, marker) %>%
group_by(car_number, year, marker) %>%
summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>%
group_by(car_number) %>%
mutate(mean_2 = pmap_dbl(
list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
weight, lag(weight), lag(weight, 2), lag(weight, 3)),
~ weighted.mean(c(..1, ..2, ..3, ..4),
c(..5, ..6, ..7, ..8),
na.rm = TRUE)
)) %>%
ungroup()
Result:
# # A tibble: 16 × 6
# car_number year marker mean_1 weight mean_2
# <dbl> <dbl> <dbl> <dbl> <int> <dbl>
# 1 10 80 1 50.0 2 50.000
# 2 10 80 2 30.0 2 40.000
# 3 10 80 3 55.0 2 45.000
# 4 10 80 4 15.0 2 37.500
# 5 10 90 1 25.0 2 31.250
# 6 10 90 2 50.0 2 36.250
# 7 10 90 3 50.0 2 35.000
# 8 10 90 4 30.0 2 38.750
# 9 20 80 1 5.0 2 5.000
# 10 20 80 2 3.0 2 4.000
# 11 20 80 3 5.5 2 4.500
# 12 20 80 4 1.5 2 3.750
# 13 20 90 1 2.5 2 3.125
# 14 20 90 2 5.0 2 3.625
# 15 20 90 3 4.5 2 3.375
# 16 20 90 4 3.0 2 3.750
Edit: Alternative syntax for purrr
versions prior to 0.2.2.9000
:
df %>%
arrange(car_number, year, marker) %>%
group_by(car_number, year, marker) %>%
summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>%
group_by(car_number) %>%
mutate(mean_2 = pmap_dbl(
list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
weight, lag(weight), lag(weight, 2), lag(weight, 3)),
function(a, b, c, d, e, f, g, h)
weighted.mean(c(a, b, c, d),
c(e, f, g, h),
na.rm = TRUE)
)) %>%
ungroup()
Upvotes: 2