Reputation: 5069
Given a dataframe df
like below
text <- "
parameter,car,qtr,val
a,a3,FY18Q1,23
b,a3,FY18Q1,10000
a,a3,FY18Q2,14
b,a3,FY18Q2,12000
a,cla,FY18Q1,15
b,cla,FY18Q1,12000
c,cla,FY18Q1,5.5
a,cla,FY18Q2,26
b,cla,FY18Q2,10000
c,cla,FY18Q2,6.2
"
df <- read.table(textConnection(text), sep = ",", header = TRUE)
I want to add a row with parameter b_diff
for each car, qtr
combination with val
as difference of parameter b
for two consecutive qtr
. The qtr
ascending order is FY18Q1, FY18Q2
. For the first qtr
which is FY18Q1
, the val
for b_diff
shall be NA as there is no previous
qtr
.
The expected output is as below.
parameter car qtr val
a a3 FY18Q1 23
b a3 FY18Q1 10000
b_diff a3 FY18Q1 NA
a a3 FY18Q2 14
b a3 FY18Q2 12000
b_diff a3 FY18Q2 2000
a cla FY18Q1 15
b cla FY18Q1 12000
c cla FY18Q1 5.5
b_diff cla FY18Q1 NA
a cla FY18Q2 26
b cla FY18Q2 10000
c cla FY18Q2 6.2
b_diff cla FY18Q2 -2000
How do I go about doing this with dplyr
?
Upvotes: 0
Views: 189
Reputation: 13913
Here is one algorithm:
qtr
and car
form a unique row index, with the parameter
column "spread" into columnsparameter_b
columnEquivalent code, using reshape2 and dplyr:
# optional. you could just use `c(NA, diff(x))` below, but this is more general
padded_diff <- function(x, lag = 1L) {
c(rep.int(NA, lag), diff(x, lag = lag))
}
df %>%
dcast(car + qtr ~ parameter, value.var = "val") %>%
mutate(b_diff = padded_diff(b)) %>%
melt(id.vars = c("car", "qtr"), variable.name = "parameter") %>%
arrange(car, qtr, parameter)
Here is another algorithm:
car
paramter == "b"
are presentval
columnEquivalent code, using only dplyr, using a temporary table to simulate a "removable" filter:
make_b_diff_within_group <- function(df) {
tmp <- df %>%
filter(parameter == "b") %>%
transmute(
qtr = qtr,
val = padded_diff(val),
parameter = "b_diff")
bind_rows(df, tmp)
}
df %>%
group_by(car) %>%
do(make_b_diff_within_group(.)) %>%
ungroup() %>%
arrange(car, qtr, parameter)
This second algorithm could be implemented using several other "split-apply-combine" paradigms, including the tapply
or by
functions in base R, the ddply
function in the plyr package (an ancestor of dplyr by the same author), and the split
method from dplyr, as shown in this answer.
Upvotes: 1
Reputation: 39174
A solution using dplyr and purrr. We can create a group ID using group_indices
and based on that to split the data frame, summarize the data and then combine them. df5
is the final output.
library(dplyr)
library(purrr)
df2 <- df %>% mutate(GroupID = group_indices(., car, qtr))
df3 <- df2 %>%
filter(parameter %in% "b") %>%
group_by(car) %>%
mutate(val = val - lag(val), parameter = "b_diff") %>%
ungroup() %>%
split(f = .$GroupID)
df4 <- df2 %>% split(f = .$GroupID)
df5 <- map2_dfr(df4, df3, bind_rows) %>% select(-GroupID)
df5
# parameter car qtr val
# 1 a a3 FY18Q1 23.0
# 2 b a3 FY18Q1 10000.0
# 3 b_diff a3 FY18Q1 NA
# 4 a a3 FY18Q2 14.0
# 5 b a3 FY18Q2 12000.0
# 6 b_diff a3 FY18Q2 2000.0
# 7 a cla FY18Q1 15.0
# 8 b cla FY18Q1 12000.0
# 9 c cla FY18Q1 5.5
# 10 b_diff cla FY18Q1 NA
# 11 a cla FY18Q2 26.0
# 12 b cla FY18Q2 10000.0
# 13 c cla FY18Q2 6.2
# 14 b_diff cla FY18Q2 -2000.0
DATA
Notice that it is better to have stringsAsFactors = FALSE
.
text <- "
parameter,car,qtr,val
a,a3,FY18Q1,23
b,a3,FY18Q1,10000
a,a3,FY18Q2,14
b,a3,FY18Q2,12000
a,cla,FY18Q1,15
b,cla,FY18Q1,12000
c,cla,FY18Q1,5.5
a,cla,FY18Q2,26
b,cla,FY18Q2,10000
c,cla,FY18Q2,6.2
"
df <- read.table(textConnection(text), sep = ",", header = TRUE, stringsAsFactors = FALSE)
Upvotes: 2