R: Sum values based on 2 condition from 2 data frame of different length

Question

I'm trying to get the total & average of some variables based on date. I have fuel data & driving data of several vehicles. Fuel data consist of several dates whereas driving data consist of more dates (Logically you'll only refuel after x trips). My end result would be to get the sum/average driving data based on fuel dates.

Fuel data:

plate = c("AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345") 
date = c("2017-09-08", "2017-09-11", "2017-09-13", "2017-09-20", "2017-09-06", "2017-09-08", "2017-09-15", "2017-09-23", "2017-09-10", "2017-09-18")
liter = c(33, 15, 28, 40, 43, 20, 25, 50, 26, 48) 
df1 = data.frame(plate, date, liter)

Driving data:

plate = c("AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345") 
date = c("2017-09-01", "2017-01-05", "2017-09-08", "2017-09-10", "2017-09-11", "2017-09-12", "2017-09-13", "2017-09-16", "2017-09-17", "2017-09-20", "2017-09-22", "2017-09-25", "2017-09-02", "2017-09-03", "2017-09-06", "2017-09-07", "2017-09-08", "2017-09-09", "2017-09-13", "2017-09-15", "2017-09-17", "2017-09-20", "2017-09-23", "2017-09-25", "2017-09-01", "2017-09-04", "2017-09-09", "2017-09-12", "2017-09-15", "2017-09-18", "2017-09-19", "2017-09-20", "2017-09-23", "2017-09-27", "2017-09-30")
mileage = c(50, 64, 45, 70, 58, 41, 22, 15, 90, 48, 52, 48, 29, 65, 70, 46, 88, 71, 40, 51, 38, 91, 74, 61, 41, 33, 59, 81, 72, 65, 43, 81, 20, 49, 39)
accx = c(0, 3, 4, 0, 8, 11, 2, 5, 9, 10, 2, 22, 9, 6, 7, 6, 8, 1, 0, 1, 8, 1, 7, 6, 4, 3, 9, 11, 22, 15, 13, 1, 2, 4, 9)
df2 = data.frame(plate, date, mileage,accx)

Merging both data

df.all = left_join(df2, df1, by.x =c("plate", "date"))

I'd like to get the total mileage(sum) and average accx based on fuel dates. The end result should look like this:

Expected data frame

Is there a way to use dplyr to mutate the desired results? FYI I only need the mutated parameters with their plates (result line 1,2,4,6,8,etc isn't needed) Thanks in advance!

kath · Accepted Answer

There might be a more elegant way, but this works:

library(dplyr)

df.all %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(plate) %>% 
  arrange(plate, date) %>% 
  mutate(t.mileage = cumsum(mileage) * !is.na(liter), 
         t.accx = cumsum(accx) * !is.na(liter), 
         n = seq_len(n())) %>% 
  filter(!is.na(liter)) %>% 
  mutate(t.mileage.lag = lag(t.mileage), 
         t.accx.lag = lag(t.accx),  
         n.lag = lag(n),
         t.mileage = ifelse(!is.na(t.mileage.lag), t.mileage - t.mileage.lag, t.mileage), 
         a.accx = ifelse(!is.na(t.accx.lag), (t.accx - t.accx.lag)/(n - n.lag), t.accx/n)) %>% 
  select(-t.mileage.lag, -t.accx.lag, -n.lag, -n, -t.accx)

# A tibble: 9 x 7
# Groups:   plate [3]
#   plate       date mileage  accx liter t.mileage     a.accx
#                        
# 1 AB123 2017-09-08      45     4    33       159  2.3333333
# 2 AB123 2017-09-11      58     8    15       128  4.0000000
# 3 AB123 2017-09-13      22     2    28        63  6.5000000
# 4 AB123 2017-09-20      48    10    40       153  8.0000000
# 5 AC234 2017-09-06      70     7    43       164  7.3333333
# 6 AC234 2017-09-08      88     8    20       134  7.0000000
# 7 AC234 2017-09-15      51     1    25       162  0.6666667
# 8 AC234 2017-09-23      74     7    50       203  5.3333333
# 9 AD345 2017-09-18      65    15    48       351 10.6666667

Group by plate to only calculate the following per different plate. Then calculate the totale mileage and totale accx with cumsum, but only keep the values where we have non-missing liter. Also count how many drives we have with n. Then as we're only interested in the information where we fuel the car we filter by non-missing liter. Using lag subtract the previous total milage and accx from each value (unless there is no previous value, i.e. the lag is NA) and then calculate average accx.

Data

df1 <- 
  data.frame(plate = c("AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345"), 
             date = c("2017-09-08", "2017-09-11", "2017-09-13", "2017-09-20", "2017-09-06", "2017-09-08", 
                      "2017-09-15", "2017-09-23", "2017-09-10", "2017-09-18"), 
             liter = c(33, 15, 28, 40, 43, 20, 25, 50, 26, 48), 
             stringsAsFactors = F)

df2 <-
  data.frame(plate = c("AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345"), 
             date = c("2017-09-01", "2017-01-05", "2017-09-08", "2017-09-10", "2017-09-11", "2017-09-12", "2017-09-13", "2017-09-16", "2017-09-17", "2017-09-20", "2017-09-22", "2017-09-25", "2017-09-02", "2017-09-03", "2017-09-06", "2017-09-07", "2017-09-08", "2017-09-09", "2017-09-13", "2017-09-15", "2017-09-17", "2017-09-20", "2017-09-23", "2017-09-25", "2017-09-01", "2017-09-04", "2017-09-09", "2017-09-12", "2017-09-15", "2017-09-18", "2017-09-19", "2017-09-20", "2017-09-23", "2017-09-27", "2017-09-30"),
             mileage = c(50, 64, 45, 70, 58, 41, 22, 15, 90, 48, 52, 48, 29, 65, 70, 46, 88, 71, 40, 51, 38, 91, 74, 61, 41, 33, 59, 81, 72, 65, 43, 81, 20, 49, 39), 
             accx = c(0, 3, 4, 0, 8, 11, 2, 5, 9, 10, 2, 22, 9, 6, 7, 6, 8, 1, 0, 1, 8, 1, 7, 6, 4, 3, 9, 11, 22, 15, 13, 1, 2, 4, 9), 
             stringsAsFactors = F)

df.all <- left_join(df2, df1, by = c("plate", "date"))

R: Sum values based on 2 condition from 2 data frame of different length

Answers (2)

Data

Related Questions