Zi-Xin
Zi-Xin

Reputation: 35

R: Sum values based on 2 condition from 2 data frame of different length

I'm trying to get the total & average of some variables based on date. I have fuel data & driving data of several vehicles. Fuel data consist of several dates whereas driving data consist of more dates (Logically you'll only refuel after x trips). My end result would be to get the sum/average driving data based on fuel dates.

Fuel data:

plate = c("AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345") 
date = c("2017-09-08", "2017-09-11", "2017-09-13", "2017-09-20", "2017-09-06", "2017-09-08", "2017-09-15", "2017-09-23", "2017-09-10", "2017-09-18")
liter = c(33, 15, 28, 40, 43, 20, 25, 50, 26, 48) 
df1 = data.frame(plate, date, liter)

Driving data:

plate = c("AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345") 
date = c("2017-09-01", "2017-01-05", "2017-09-08", "2017-09-10", "2017-09-11", "2017-09-12", "2017-09-13", "2017-09-16", "2017-09-17", "2017-09-20", "2017-09-22", "2017-09-25", "2017-09-02", "2017-09-03", "2017-09-06", "2017-09-07", "2017-09-08", "2017-09-09", "2017-09-13", "2017-09-15", "2017-09-17", "2017-09-20", "2017-09-23", "2017-09-25", "2017-09-01", "2017-09-04", "2017-09-09", "2017-09-12", "2017-09-15", "2017-09-18", "2017-09-19", "2017-09-20", "2017-09-23", "2017-09-27", "2017-09-30")
mileage = c(50, 64, 45, 70, 58, 41, 22, 15, 90, 48, 52, 48, 29, 65, 70, 46, 88, 71, 40, 51, 38, 91, 74, 61, 41, 33, 59, 81, 72, 65, 43, 81, 20, 49, 39)
accx = c(0, 3, 4, 0, 8, 11, 2, 5, 9, 10, 2, 22, 9, 6, 7, 6, 8, 1, 0, 1, 8, 1, 7, 6, 4, 3, 9, 11, 22, 15, 13, 1, 2, 4, 9)
df2 = data.frame(plate, date, mileage,accx)

Merging both data

df.all = left_join(df2, df1, by.x =c("plate", "date"))

I'd like to get the total mileage(sum) and average accx based on fuel dates. The end result should look like this:

Expected data frame

Is there a way to use dplyr to mutate the desired results? FYI I only need the mutated parameters with their plates (result line 1,2,4,6,8,etc isn't needed) Thanks in advance!

Upvotes: 1

Views: 148

Answers (2)

LAP
LAP

Reputation: 6685

Although @kath provided a more convenient solution, here's a base R one (if only because I spend some time working on it):

# generate factor to split on
temp <- which(!is.na(df.all$liter))
vec <- temp - c(0, temp[-length(temp)])
df.all$split <- rep(seq(1, length(temp)+1), c(vec, nrow(df.all)-temp[length(temp)]))

# split df.all and calculate t.mileage and a.accx for each subsample
df.temp <- split(df.all, df.all$split)
t.mileage <- sapply(df.temp, function(x) sum(x[, "mileage"]))
a.accx <- sapply(df.temp, function(x) mean(x[, "accx"]))

# generate new variables and insert calculated values
df.all$t.mileage <- NA
df.all$t.mileage[temp] <- t.mileage[-length(t.mileage)]
df.all$a.accx <- NA
df.all$a.accx[temp] <- a.accx[-length(a.accx)]

# display df.all without splitting factor
df.all <- subset(df.all, select = -split)

> df.all
   plate       date mileage accx liter t.mileage     a.accx
1  AB123 2017-09-01      50    0    NA        NA         NA
2  AB123 2017-01-05      64    3    NA        NA         NA
3  AB123 2017-09-08      45    4    33       159  2.3333333
4  AB123 2017-09-10      70    0    NA        NA         NA
5  AB123 2017-09-11      58    8    15       128  4.0000000
6  AB123 2017-09-12      41   11    NA        NA         NA
7  AB123 2017-09-13      22    2    28        63  6.5000000
8  AB123 2017-09-16      15    5    NA        NA         NA
9  AB123 2017-09-17      90    9    NA        NA         NA
10 AB123 2017-09-20      48   10    40       153  8.0000000
11 AB123 2017-09-22      52    2    NA        NA         NA
12 AB123 2017-09-25      48   22    NA        NA         NA
13 AC234 2017-09-02      29    9    NA        NA         NA
14 AC234 2017-09-03      65    6    NA        NA         NA
15 AC234 2017-09-06      70    7    43       264  9.2000000
16 AC234 2017-09-07      46    6    NA        NA         NA
17 AC234 2017-09-08      88    8    20       134  7.0000000
18 AC234 2017-09-09      71    1    NA        NA         NA
19 AC234 2017-09-13      40    0    NA        NA         NA
20 AC234 2017-09-15      51    1    25       162  0.6666667
21 AC234 2017-09-17      38    8    NA        NA         NA
22 AC234 2017-09-20      91    1    NA        NA         NA
23 AC234 2017-09-23      74    7    50       203  5.3333333
24 AC234 2017-09-25      61    6    NA        NA         NA
25 AD345 2017-09-01      41    4    NA        NA         NA
26 AD345 2017-09-04      33    3    NA        NA         NA
27 AD345 2017-09-09      59    9    NA        NA         NA
28 AD345 2017-09-12      81   11    NA        NA         NA
29 AD345 2017-09-15      72   22    NA        NA         NA
30 AD345 2017-09-18      65   15    48       412 10.0000000
31 AD345 2017-09-19      43   13    NA        NA         NA
32 AD345 2017-09-20      81    1    NA        NA         NA
33 AD345 2017-09-23      20    2    NA        NA         NA
34 AD345 2017-09-27      49    4    NA        NA         NA
35 AD345 2017-09-30      39    9    NA        NA         NA

By the way, there has to be a simpler way to generate the factor of step 1 above, does anyone know how?

Upvotes: 1

kath
kath

Reputation: 7724

There might be a more elegant way, but this works:

library(dplyr)

df.all %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(plate) %>% 
  arrange(plate, date) %>% 
  mutate(t.mileage = cumsum(mileage) * !is.na(liter), 
         t.accx = cumsum(accx) * !is.na(liter), 
         n = seq_len(n())) %>% 
  filter(!is.na(liter)) %>% 
  mutate(t.mileage.lag = lag(t.mileage), 
         t.accx.lag = lag(t.accx),  
         n.lag = lag(n),
         t.mileage = ifelse(!is.na(t.mileage.lag), t.mileage - t.mileage.lag, t.mileage), 
         a.accx = ifelse(!is.na(t.accx.lag), (t.accx - t.accx.lag)/(n - n.lag), t.accx/n)) %>% 
  select(-t.mileage.lag, -t.accx.lag, -n.lag, -n, -t.accx)

# A tibble: 9 x 7
# Groups:   plate [3]
#   plate       date mileage  accx liter t.mileage     a.accx
#   <chr>     <date>   <dbl> <dbl> <dbl>     <dbl>      <dbl>
# 1 AB123 2017-09-08      45     4    33       159  2.3333333
# 2 AB123 2017-09-11      58     8    15       128  4.0000000
# 3 AB123 2017-09-13      22     2    28        63  6.5000000
# 4 AB123 2017-09-20      48    10    40       153  8.0000000
# 5 AC234 2017-09-06      70     7    43       164  7.3333333
# 6 AC234 2017-09-08      88     8    20       134  7.0000000
# 7 AC234 2017-09-15      51     1    25       162  0.6666667
# 8 AC234 2017-09-23      74     7    50       203  5.3333333
# 9 AD345 2017-09-18      65    15    48       351 10.6666667

Group by plate to only calculate the following per different plate. Then calculate the totale mileage and totale accx with cumsum, but only keep the values where we have non-missing liter. Also count how many drives we have with n. Then as we're only interested in the information where we fuel the car we filter by non-missing liter. Using lag subtract the previous total milage and accx from each value (unless there is no previous value, i.e. the lag is NA) and then calculate average accx.

Data

df1 <- 
  data.frame(plate = c("AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345"), 
             date = c("2017-09-08", "2017-09-11", "2017-09-13", "2017-09-20", "2017-09-06", "2017-09-08", 
                      "2017-09-15", "2017-09-23", "2017-09-10", "2017-09-18"), 
             liter = c(33, 15, 28, 40, 43, 20, 25, 50, 26, 48), 
             stringsAsFactors = F)

df2 <-
  data.frame(plate = c("AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345"), 
             date = c("2017-09-01", "2017-01-05", "2017-09-08", "2017-09-10", "2017-09-11", "2017-09-12", "2017-09-13", "2017-09-16", "2017-09-17", "2017-09-20", "2017-09-22", "2017-09-25", "2017-09-02", "2017-09-03", "2017-09-06", "2017-09-07", "2017-09-08", "2017-09-09", "2017-09-13", "2017-09-15", "2017-09-17", "2017-09-20", "2017-09-23", "2017-09-25", "2017-09-01", "2017-09-04", "2017-09-09", "2017-09-12", "2017-09-15", "2017-09-18", "2017-09-19", "2017-09-20", "2017-09-23", "2017-09-27", "2017-09-30"),
             mileage = c(50, 64, 45, 70, 58, 41, 22, 15, 90, 48, 52, 48, 29, 65, 70, 46, 88, 71, 40, 51, 38, 91, 74, 61, 41, 33, 59, 81, 72, 65, 43, 81, 20, 49, 39), 
             accx = c(0, 3, 4, 0, 8, 11, 2, 5, 9, 10, 2, 22, 9, 6, 7, 6, 8, 1, 0, 1, 8, 1, 7, 6, 4, 3, 9, 11, 22, 15, 13, 1, 2, 4, 9), 
             stringsAsFactors = F)

df.all <- left_join(df2, df1, by = c("plate", "date"))

Upvotes: 2

Related Questions