Reputation: 1829
I have a dataframe called (df) with the following structure:
day colA colB
1 4 8
2 9 2
3 8 1
4 9 4
1 5 8
2 2 4
Problem:
I want to create a new column called 'colC' with the following logic:
if day is equal to 1, then colC = colA+colB
if day not equal to 1, then colC = (colA+colB) - previous ColC
Expected solution:
day colA colB colC
1 4 8 12
2 9 2 -1
3 8 1 10
4 9 4 3
1 5 8 13
2 2 4 -7
Current Solution:
for(i in 1:NROW(df)
{
if(day[i] == 1) {colC[i] <- colA[i] + colB[i]}
else {colC[i] <- colA[i] + colB[i] - colC[i-1]}
}
Issue:
Upvotes: 0
Views: 2712
Reputation: 161085
Your data:
df <- structure(list(day = c(1L, 2L, 3L, 4L, 1L, 2L), colA = c(4L,
9L, 8L, 9L, 5L, 2L), colB = c(8L, 2L, 1L, 4L, 8L, 4L)), .Names = c("day",
"colA", "colB"), class = "data.frame", row.names = c(NA, -6L))
The starting case:
df$colC <- df$colA + df$colB
Instead of looping with for
, I suggest iterating over each sequence of days, declared with cumsum(df$day == 1)
. From there, we can use a trick with Reduce
to do a rolling-apply:
df <- do.call(rbind, by(df, cumsum(df$day == 1), function(d) {
d$colC <- Reduce(function(a,b) b-a, d$colC[-1], d$colC[1], accumulate=T)
d
}))
I think your expected output is incorrect on row 5, where it should be simply 5+8
because day==1
.
df
# day colA colB colC
# 1.1 1 4 8 12
# 1.2 2 9 2 -1
# 1.3 3 8 1 10
# 1.4 4 9 4 3
# 2.5 1 5 8 13
# 2.6 2 2 4 -7
Edit: I argue you should update your code to be more group-centric instead of row-centric, but since you feel you need to use for
, start with the original df
:
df$colC <- df$colA + df$colB
for (i in seq_len(nrow(df))) {
df$colC[i] <- df$colC[i] - ifelse(i < 2 | df$day[i] == 1, 0, df$colC[i-1])
}
Or perhaps (slightly faster):
df$colC <- df$colA + df$colB
for (i in seq_len(nrow(df))) {
if (i > 1) df$colC[i] <- df$colC[i] - (df$day[i] != 1) * df$colC[i-1]
}
Upvotes: 1
Reputation: 6695
This should be pretty fast. You need the package dplyr
.
df$colC <- df$colA+df$colB
df$colD <- dplyr::lag(df$colC,1)
df$colC <- ifelse(df$day != 1, df$colC-df$colD, df$colC)
> df[, 1:4]
day colA colB colC
1 1 4 8 12
2 2 9 2 -1
3 3 8 1 -2
4 4 9 4 4
5 1 5 8 13
6 2 2 4 -7
By the way, your expected output seems to be false.
Upvotes: 0
Reputation: 5673
I have a solution using shifts, and looping over the maximum distance between two row of day == 1
df$colC = df$colA + df$colB
for (i in 1:max(diff(which(df$day == 1))))
{
df$colCshift = c(NA,df$colC[1:(length(df$colC)-1)]) # creating a shifted version of colC
df$colC[df$day != 1] = df$colA[df$day != 1] + df$colB[df$day != 1] - df$colCshift[df$day != 1]
# here colC[i-1] is the shifted version of ColC.
}
day colA colB colC colCshift
1 1 4 8 12 NA
2 2 9 2 -1 12
3 3 8 1 10 -1
4 4 9 4 3 10
5 1 5 8 13 3
6 2 2 4 -7 13
The first iteration calculate properly all the rows just after day = 1 everywhere in your data frame. I then update the shifted version of colC, and calculate all rows two lines after day = 1, etc.
Here max(diff(which(df$day == 1)))
should be 7 from what you say
Upvotes: 0