Reputation: 1829

Apply a for loop by group in a dataframe

I have a dataframe called (df) with the following structure:

day colA colB
1   4    8
2   9    2
3   8    1
4   9    4 
1   5    8
2   2    4

Problem:

I want to create a new column called 'colC' with the following logic:

if day is equal to 1, then colC = colA+colB
if day not equal to 1, then colC = (colA+colB) - previous ColC

Expected solution:

day colA colB colC
1   4    8    12
2   9    2    -1
3   8    1    10
4   9    4    3 
1   5    8    13
2   2    4    -7

Current Solution:

for(i in 1:NROW(df)
{
if(day[i] == 1) {colC[i] <- colA[i] + colB[i]} 
else {colC[i] <- colA[i] + colB[i] - colC[i-1]}  
}

Issue:

The size of the dataset is too large hence, for loop is tremendously slow
In my dataset the day column as values till day 7
Basically I am looking for a solution where dataset can be split and for loop can be applied independently for each subset (assume each continous day 1-4 are one group)

Upvotes: 0

Answers (3)

r2evans

Reputation: 161085

Your data:

df <- structure(list(day = c(1L, 2L, 3L, 4L, 1L, 2L), colA = c(4L, 
9L, 8L, 9L, 5L, 2L), colB = c(8L, 2L, 1L, 4L, 8L, 4L)), .Names = c("day", 
"colA", "colB"), class = "data.frame", row.names = c(NA, -6L))

The starting case:

df$colC <- df$colA + df$colB

Instead of looping with for, I suggest iterating over each sequence of days, declared with cumsum(df$day == 1). From there, we can use a trick with Reduce to do a rolling-apply:

df <- do.call(rbind, by(df, cumsum(df$day == 1), function(d) {
  d$colC <- Reduce(function(a,b) b-a, d$colC[-1], d$colC[1], accumulate=T)
  d
}))

I think your expected output is incorrect on row 5, where it should be simply 5+8 because day==1.

df
#     day colA colB colC
# 1.1   1    4    8   12
# 1.2   2    9    2   -1
# 1.3   3    8    1   10
# 1.4   4    9    4    3
# 2.5   1    5    8   13
# 2.6   2    2    4   -7

Edit: I argue you should update your code to be more group-centric instead of row-centric, but since you feel you need to use for, start with the original df:

df$colC <- df$colA + df$colB
for (i in seq_len(nrow(df))) {
  df$colC[i] <- df$colC[i] - ifelse(i < 2 | df$day[i] == 1, 0, df$colC[i-1])
}

Or perhaps (slightly faster):

df$colC <- df$colA + df$colB
for (i in seq_len(nrow(df))) {
  if (i > 1) df$colC[i] <- df$colC[i] - (df$day[i] != 1) * df$colC[i-1]
}

Upvotes: 1

LAP

Reputation: 6695

This should be pretty fast. You need the package dplyr.

df$colC <- df$colA+df$colB

df$colD <- dplyr::lag(df$colC,1)

df$colC <- ifelse(df$day != 1, df$colC-df$colD, df$colC)

> df[, 1:4]
  day colA colB colC
1   1    4    8   12
2   2    9    2   -1
3   3    8    1   -2
4   4    9    4    4
5   1    5    8   13
6   2    2    4   -7

By the way, your expected output seems to be false.

Upvotes: 0

denis

Reputation: 5673

I have a solution using shifts, and looping over the maximum distance between two row of day == 1

df$colC = df$colA + df$colB

for (i in 1:max(diff(which(df$day == 1))))
{
df$colCshift = c(NA,df$colC[1:(length(df$colC)-1)]) # creating a shifted version of colC
df$colC[df$day != 1] = df$colA[df$day != 1] + df$colB[df$day != 1] - df$colCshift[df$day != 1]
# here colC[i-1] is the shifted version of ColC.
}

  day colA colB colC colCshift
1   1    4    8   12        NA
2   2    9    2   -1        12
3   3    8    1   10        -1
4   4    9    4    3        10
5   1    5    8   13         3
6   2    2    4   -7        13

The first iteration calculate properly all the rows just after day = 1 everywhere in your data frame. I then update the shifted version of colC, and calculate all rows two lines after day = 1, etc.

Here max(diff(which(df$day == 1))) should be 7 from what you say

Upvotes: 0

Apply a for loop by group in a dataframe

Answers (3)

Related Questions