BillyJean
BillyJean

Reputation: 1577

Averaging dataframe based on current row-value and preceeding rows

I have a simple data set with the following form

df<- data.frame(c(10, 10, 10,  10,  10,  10,  10,  10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),   
                c(80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90, 80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90),
                c(1,    1,   2,   2,   3,   3,   4,   4,     1,   1,    2, 2,   3,    3,   4,   4,  1,    1,   2,   2,   3,   3,   4,   4,     1,   1,    2, 2,   3,    3,   4,   4),
                c(25,   75,  20,  40,  60,  50,  20,  10,  20,  30,  40,  60, 25,   75,  20,  40, 5,   5,  2,  4,  6,  5,  2,  1,  2,  3,  4,  6, 2,   7,  2,  4))

colnames(df)<-c("car_number", "year", "marker", "val")

What I am trying to do is quite simple, actually: Per car_number, I want to find the average of the values associated with a marker -value and the preceeding 3 values. So for the example data above the output I want is

car=10, year=80 1: 50
car=10, year=80 2: 40
car=10, year=80 3: 45
car=10, year=80 4: 37.5

car=10, year=90 1: 31.25
car=10, year=90 2: 36.25
car=10, year=90 3: 35
car=10, year=90 4: 38.75


car=20, year=80 1: 5
car=20, year=80 2: 4
car=20, year=80 3: 4.5
car=20, year=80 4: 3.75

car=20, year=90 1: 3.125
car=20, year=90 2: 3.625
car=20, year=90 3: 3.375
car=20, year=90 4: 3.750

Note that for simplicity of the example the markers above come in pairs of two. That is not the case with the real data, so I am thinking a general solution will contain some sort of group_by (?)

Any efficient solution is welcome!


Here is a second example data set and output:

df<- data.frame(c(10, 10, 10,  10,  10,  10,  10,  10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),   
                c(80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90, 80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90),
                c(1,    2,   2,   2,   3,   3,   4,   4,     1,   1,    2, 2,   3,    3,   3,   4,  1,    1,   1,   2,   3,   3,   4,   4,     4,   1,    2, 2,   3,    3,   3,   4),
                c(25,   75,  20,  40,  60,  50,  20,  10,  20,  30,  40,  60, 25,   75,  20,  40, 5,   5,  2,  4,  6,  5,  2,  1,  2,  3,  4,  6, 2,   7,  2,  4))

colnames(df)<-c("car_number", "year", "marker", "val")

And the output is (based on the rules above)

car=10, year=80 1: Mean{{25}]                                  = 25
car=10, year=80 2: Mean[{40, 20, 75, 25}]                      = 40
car=10, year=80 3: Mean[{50, 60, 40, 20, 75, 25}]              = 45
car=10, year=80 4: Mean[{10, 20, 50, 60, 40, 20, 75, 25}]      = 37.5

car=10, year=90 1: Mean[{30, 20, 10, 20, 50, 60, 40, 20, 75}] = 36.11
car=10, year=90 2: Mean[{60, 40, 30, 20, 10, 20, 50, 60}]     = 36.25
car=10, year=90 3: Mean[{20, 75, 25, 60, 40, 30, 20, 10, 20}] = 33.33
car=10, year=90 4: Mean[{40, 20, 75, 25, 60, 40, 30, 20}]     = 38.75

car=20, year=80 1: Mean[{2, 5, 5}]                   = 4
car=20, year=80 2: Mean[{4, 2, 5, 5}]                = 4
car=20, year=80 3: Mean[{5, 6, 4, 2, 5, 5}]          = 4.5
car=20, year=80 4: Mean[{2, 1, 2, 5, 6, 4, 2, 5, 5}] = 3.55

car=20, year=90 1: Mean[{3, 2, 1, 2, 5, 6, 4}]       = 3.29
car=20, year=90 2: Mean[{6, 4, 3, 2, 1, 2, 5, 6}]    = 3.625
car=20, year=90 3: Mean[{2, 7, 2, 6, 4, 3, 2, 1, 2}] = 3.22
car=20, year=90 4: Mean[{4, 2, 7, 2, 6, 4, 3}]       = 4

Upvotes: 3

Views: 165

Answers (5)

Aramis7d
Aramis7d

Reputation: 2496

considering df as your input, you can use dplyr and zoo and try:

grouping only over car_number, you can try:

df %>%
  group_by(car_number, year, marker) %>%
  summarise(mm = mean(val)) %>%
  group_by(car_number) %>%
  mutate(rM=rollapply(mm, if_else(row_number() < 4, marker, 4), FUN=mean, align="right"))%>%
  select(year, rM)

which gives:

Source: local data frame [16 x 3]
Groups: car_number [2]

   car_number  year     rM
        <dbl> <dbl>  <dbl>
1          10    80 50.000
2          10    80 40.000
3          10    80 45.000
4          10    80 37.500
5          10    90 31.250
6          10    90 36.250
7          10    90 35.000
8          10    90 38.750
9          20    80  5.000
10         20    80  4.000
11         20    80  4.500
12         20    80  3.750
13         20    90  3.125
14         20    90  3.625
15         20    90  3.375
16         20    90  3.750

Upvotes: 0

Scarabee
Scarabee

Reputation: 5704

You can do it this way:

df %>%
  group_by(car_number, year, marker) %>%
  summarise(s = sum(val), w = n()) %>% # sum and number of values
  group_by(car_number) %>%
  mutate(S = cumsum(s) - cumsum(lag(s, 4, default=0))) %>% # sum of last four s
  mutate(W = cumsum(w) - cumsum(lag(w, 4, default=0))) %>% # same for the weights
  mutate(result = S/W)

Output of your second example:

# Source: local data frame [16 x 8]
# Groups: car_number [2]
# 
#    car_number  year marker     s     w     S     W    result
#         <dbl> <dbl>  <dbl> <dbl> <int> <dbl> <int>     <dbl>
# 1          10    80      1    25     1    25     1 25.000000
# 2          10    80      2   135     3   160     4 40.000000
# 3          10    80      3   110     2   270     6 45.000000
# 4          10    80      4    30     2   300     8 37.500000
# 5          10    90      1    50     2   325     9 36.111111
# 6          10    90      2   100     2   290     8 36.250000
# 7          10    90      3   120     3   300     9 33.333333
# 8          10    90      4    40     1   310     8 38.750000
# 9          20    80      1    12     3    12     3  4.000000
# 10         20    80      2     4     1    16     4  4.000000
# 11         20    80      3    11     2    27     6  4.500000
# 12         20    80      4     5     3    32     9  3.555556
# 13         20    90      1     3     1    23     7  3.285714
# 14         20    90      2    10     2    29     8  3.625000
# 15         20    90      3    11     3    29     9  3.222222
# 16         20    90      4     4     1    28     7  4.000000

Edit: It's probably more efficient with package RcppRoll, you can try that: S = roll_sum(c(0, 0, 0, s), 4) (and same for W).

Upvotes: 0

lmo
lmo

Reputation: 38500

Here is a method with data.table that modifies Frank's suggestion in David Arenburg's answer here.

# aggregate data by car_number, year, and marker
dfNew <- setDT(df)[, .(val=mean(val)), by=.(car_number, year, marker)]
# calculate average of current a previous three values
dfNew[, val := rowMeans(dfNew[,shift(val, 0:3), by=car_number][, -1], na.rm=TRUE)]

The first line is a standard aggregation call. The second line makes some changes to the rowMeans method in the linked answer. rowMeans is fed a data.table of the shifted values, where the shift occurs by car_number (thanks to sotos for catching this), which is chained to a statement that drops the first column (using -1), which is the car_number column returned in the first part of the chain.

this returns

   car_number year marker    val
 1:         10   80      1 50.000
 2:         10   80      2 40.000
 3:         10   80      3 45.000
 4:         10   80      4 37.500
 5:         10   90      1 31.250
 6:         10   90      2 36.250
 7:         10   90      3 35.000
 8:         10   90      4 38.750
 9:         20   80      1  5.000
10:         20   80      2  4.000
11:         20   80      3  4.500
12:         20   80      4  3.750
13:         20   90      1  3.125
14:         20   90      2  3.625
15:         20   90      3  3.375
16:         20   90      4  3.750

Upvotes: 2

Sotos
Sotos

Reputation: 51582

Just throwing a base R solution in the mix. We can make a custom function using Reduce with accumulate = TRUE and tail(x, 4) to ensure that only last 3 observations will be included. All these after we average the data set by car_type, year, marker, i.e.

f1 <- function(x){
    sapply(Reduce(c, x, accumulate = TRUE), function(i) mean(tail(i,4)))
  }

dd <- aggregate(val ~ car_number+year+marker, df, mean)
dd <- dd[order(dd$car_number, dd$year, dd$marker),]
dd$new_avg <- with(dd, ave(val, car_number, FUN = f1))

dd
#   car_number year marker  val new_avg
#1          10   80      1 50.0  50.000
#5          10   80      2 30.0  40.000
#9          10   80      3 55.0  45.000
#13         10   80      4 15.0  37.500
#3          10   90      1 25.0  31.250
#7          10   90      2 50.0  36.250
#11         10   90      3 50.0  35.000
#15         10   90      4 30.0  38.750
#2          20   80      1  5.0   5.000
#6          20   80      2  3.0   4.000
#10         20   80      3  5.5   4.500
#14         20   80      4  1.5   3.750
#4          20   90      1  2.5   3.125
#8          20   90      2  5.0   3.625
#12         20   90      3  4.5   3.375
#16         20   90      4  3.0   3.750

Upvotes: 2

Aur&#232;le
Aur&#232;le

Reputation: 12819

A first group_by computes the mean by car_number, year, marker, and retains its weight (number of rows).
A second group_by by car_number allows us to retrieve lagging means and weights to compute the desired mean:

library(purrr)
library(dplyr)
df %>% 
  arrange(car_number, year, marker) %>% 
  group_by(car_number, year, marker) %>% 
  summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>% 
  group_by(car_number) %>% 
  mutate(mean_2 = pmap_dbl(
    list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
         weight, lag(weight), lag(weight, 2), lag(weight, 3)),
    ~ weighted.mean(c(..1, ..2, ..3, ..4),
                    c(..5, ..6, ..7, ..8),
                    na.rm = TRUE)
  )) %>%
  ungroup()

Result:

# # A tibble: 16 × 6
#    car_number  year marker mean_1 weight mean_2
#         <dbl> <dbl>  <dbl>  <dbl>  <int>  <dbl>
# 1          10    80      1   50.0      2 50.000
# 2          10    80      2   30.0      2 40.000
# 3          10    80      3   55.0      2 45.000
# 4          10    80      4   15.0      2 37.500
# 5          10    90      1   25.0      2 31.250
# 6          10    90      2   50.0      2 36.250
# 7          10    90      3   50.0      2 35.000
# 8          10    90      4   30.0      2 38.750
# 9          20    80      1    5.0      2  5.000
# 10         20    80      2    3.0      2  4.000
# 11         20    80      3    5.5      2  4.500
# 12         20    80      4    1.5      2  3.750
# 13         20    90      1    2.5      2  3.125
# 14         20    90      2    5.0      2  3.625
# 15         20    90      3    4.5      2  3.375
# 16         20    90      4    3.0      2  3.750

Edit: Alternative syntax for purrr versions prior to 0.2.2.9000:

df %>% 
  arrange(car_number, year, marker) %>% 
  group_by(car_number, year, marker) %>% 
  summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>% 
  group_by(car_number) %>% 
  mutate(mean_2 = pmap_dbl(
    list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
         weight, lag(weight), lag(weight, 2), lag(weight, 3)),
    function(a, b, c, d, e, f, g, h)
      weighted.mean(c(a, b, c, d),
                    c(e, f, g, h),
                    na.rm = TRUE)
  )) %>%
  ungroup()

Upvotes: 2

Related Questions