Demo
Demo

Reputation: 301

Interpolate missing values, and conducting cumulative sum based on other columns in R

I have a data frame like this:

wpt    ID   Fuel  Dist  Express   Local
 1     S36   12    1     1         0
 2     S36   14    2     1         0
 inter S36   NA    NA    1         0
 inter S36   NA    NA    1         0
 3     S36   16    4     1         0
 inter S36   NA    NA    0         1
 4     S36   18    6     0         1
 5     S36   22    7     0         1
 6     W09   45    9     1         0
 inter W09   NA    NA    1         0
 inter W09   NA    NA    1         0
 inter W09   NA    NA    1         0
 7     W09   48    14    0         1
 8     W09   50    15    0         1

(1)I'd like to interpolate and insert values to the places with NAs of Fuel and Dist column. I treat the rows with "inter" together with it start and end rows of regular numbered "wpt" as a unit. Then conduct interpolation.

The expected output is like this:

wpt    ID   Fuel     Dist  Express   Local
 1     S36   12       1     1         0
 2     S36   14       2     1         0
 inter S36   14.6667  2.67  1         0
 inter S36   15.3333  3.33  1         0
 3     S36   16       4     1         0
 inter S36   17       5     0         1
 4     S36   18       6     0         1
 5     S36   22       7     0         1
 6     W09   45       9     1         0
 inter W09   45.75    10.25 1         0
 inter W09   46.50    11.50 1         0
 inter W09   47.25    12.75 1         0
 7     W09   48       14    0         1
 8     W09   50       15    0         1

To be clear, the first segment interpolation is calculated like this:

> seq(14,16,length.out = 4)
[1] 14.00000 14.66667 15.33333 16.00000

(2) Then I want to get cumulative sum of each class of Express and Local by ID. The expected output is like this:

ID  Cumsum.Fuel  Cumsum.Dist Express  Local
S36    4             3          1       0
S36    5             2          0       1
W09    2.25          3.75       1       0
W09    2             1          0       1

To be clear, Cum.sum.Fuel for "S36" for Express is 16-12=4. The same applies to others.

Thanks in advance!!!

Upvotes: 1

Views: 468

Answers (2)

akrun
akrun

Reputation: 887951

To fill both the columns, we can use mutate_at after grouping by "ID"

library(dplyr)
library(zoo)
df2 <- df1 %>% 
         group_by(ID) %>% 
         mutate_at(vars(Fuel, Dist), na.approx) 
df2
#     wpt    ID     Fuel      Dist Express Local
#   <chr> <chr>    <dbl>     <dbl>   <int> <int>
#1      1   S36 12.00000  1.000000       1     0
#2      2   S36 14.00000  2.000000       1     0
#3  inter   S36 14.66667  2.666667       1     0
#4  inter   S36 15.33333  3.333333       1     0
#5      3   S36 16.00000  4.000000       1     0
#6  inter   S36 17.00000  5.000000       0     1
#7      4   S36 18.00000  6.000000       0     1
#8      5   S36 22.00000  7.000000       0     1
#9      6   W09 45.00000  9.000000       1     0
#10 inter   W09 45.75000 10.250000       1     0
#11 inter   W09 46.50000 11.500000       1     0
#12 inter   W09 47.25000 12.750000       1     0
#13     7   W09 48.00000 14.000000       0     1
#14     8   W09 50.00000 15.000000       0     1

For the second part,

library(data.table)
df2 %>%
   group_by(ID, Express1 = rleid(Express), Local1 = rleid(Local)) %>%
   summarise(Express = first(Express),
             Local = first(Local), 
             Cumsum.Fuel = last(Fuel) - first(Fuel),
             Cumsum.Dist = last(Dist) - first(Dist))  %>%
    ungroup() %>% 
    select(-Express1, - Local1)
#Source: local data frame [4 x 5]
#    ID Express Local Cumsum.Fuel Cumsum.Dist
#  <chr>   <int> <int>       <dbl>       <dbl>
#1   S36       1     0        4.00        3.00
#2   S36       0     1        5.00        2.00
#3   W09       1     0        2.25        3.75
#4   W09       0     1        2.00        1.00

Or we can do this without the rleid

df2 %>%
    group_by(ID, Express, Local) %>% 
    summarise(Cumsum.Fuel = last(Fuel) - first(Fuel), 
              Cumsum.Dist = last(Dist) - first(Dist))  

Upvotes: 1

DatamineR
DatamineR

Reputation: 9628

For the first task you can use:

library(zoo)
na.approx(df$Fuel)
 [1] 12.00000 14.00000 14.66667 15.33333 16.00000 17.00000 18.00000 22.00000 45.00000 45.75000
[11] 46.50000 47.25000 48.00000 50.00000

Upvotes: 3

Related Questions