Sergio Chavez Villa
Sergio Chavez Villa

Reputation: 89

For loop for dataframes in R

I am trying to do a function of decumulation with a for loop in R because the financial information provided by the company is accumulated for different concepts (this means that the info of January is only of January, the info of February is the sum of January and February, the one of March is the sum of January, February and March, etc.).

For example, let's say that I have the next dataframe:

Concepts <- c("Concept1", "Concept2", "Concept3")
January <- c(5,10,16)
February <- c(9,14,20)
March <- c(16,20,23)

df <- data.frame(Concepts, January, February, March)

This will give me the next dataframe:

Concepts  January  February  March
Concept1    5         9        16 
Concept2    10        14       20
Concept3    16        20       23 

What I need to achieve is the next dataframe (Notice that February is the difference between February and January, and March is the difference between February and March):

Concepts  January  February  March
Concept1    5         4        7 
Concept2    10        4        6
Concept3    16        4        3

To achieve the second dataframe, I first created an empty dataframe with the same amount of rows of df, then with a for loop cbind the first two rows of the dataframe (because they do not need any manipulation) and with the index add the next ones after calculated the difference. The above in code is as follows:

df <- data.frame(Concepts, January, February, March)
df2 <- data.frame(matrix(nrow=nrow(df),ncol=ncol(df))) #Empty Dataframe with the same number  of rows

for(i in 1:ncol(df)) {
  if(i == 1){
    df2 <- cbind(df2, df[ , i])
  } else if (i == 2){
    df2 <- cbind(df2, df[, i])
  } else {
    diference <- df[,i] - df[,i-1]
    df2 <- cbind(df2,diference)
  }

I get the following error:

error in [.data.table(df, , i) : j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[, ..i]. This difference to data.frame is deliberate and explained in FAQ 1.1.

I would love to receive a correction to my code or some alternative that allows me to calculate the above for a dataframe of many years.

Upvotes: 7

Views: 8985

Answers (4)

Avraham
Avraham

Reputation: 1719

This may not be the most elegant, but it should work. The trick is to extract the numeric portion of the data frame and apply diff—by row—to the results, transpose it, and paste it back to the initial values.

df <- data.frame(Concepts = c("Concept1", "Concept2", "Concept3"),
                 January = c(5,10,16),
                 February = c(9,14,20),
                 March = c(16,20,23),
                 April = c(20, 27, 33))

dfdiff <- apply(df[, -1L], 1L, diff)
df2 <- data.frame(Concepts = c("Concept1", "Concept2", "Concept3"),
                  January = c(5,10,16))
df2 <- cbind(df2, t(dfdiff))                  
df2
  Concepts January February March April
1 Concept1       5        4     7     4
2 Concept2      10        4     6     7
3 Concept3      16        4     3    10

Now that you know how it works, for a more efficient call you could just do:

df2 <- cbind(df[, 1:2], t(apply(df[, -1L], 1L, diff)))

Which should work for any size dataframe of the structure you had above: One heading column and the remainder being cumulative data columns.

Speed comparison with tidyverse method

microbenchmark(TV = df2 <- df %>% pivot_longer(!Concepts) %>% group_by(Concepts) %>%
                 dplyr::mutate(value2 = value - lag(value, default = first(value))) %>%
                 rowwise %>%mutate(value2 = ifelse(value2 == 0, value, value2)) %>%
                 select(-value) %>%pivot_wider(names_from = "name", values_from = "value2"),
               BASE = df2 <- cbind(df[, 1:2], t(apply(df[, -1L], 1L, diff))),
               times = 1000L, control = list(order = 'block'))

Unit: microseconds
 expr     min       lq       mean   median       uq     max neval cld
   TV 11141.7 11554.05 12245.1253 11803.75 12300.25 22903.5  1000   b
 BASE   160.4   164.35   176.8356   165.70   168.70  3833.2  1000  a 

Upvotes: 0

Sergio Chavez Villa
Sergio Chavez Villa

Reputation: 89

The solution to my problem was more simple that I expected:

for(i in 1:ncol(df)) {
  if(i == 1){
   df2 <- cbind(df2, df[ ,..i])
} else if (i == 2){
   df2 <- cbind(df2, df[,..i])
} else {
   diference <- df[,i] - df[,..i-1]
   df2 <- cbind(df2,diference)
}

Thanks for all the alternative solutions!

Upvotes: 1

AndrewGB
AndrewGB

Reputation: 16876

A tidyverse solution:

library(tidyverse)

df %>%
  pivot_longer(!Concepts) %>%
  group_by(Concepts) %>%
  mutate(value2 = value - lag(value, default = first(value))) %>%
  rowwise %>%
  mutate(value2 = ifelse(value2 == 0, value, value2)) %>%
  select(-value) %>%
  pivot_wider(names_from = "name", values_from = "value2")

Output

# A tibble: 3 × 4
  Concepts January February March
  <chr>      <dbl>    <dbl> <dbl>
1 Concept1       5        4     7
2 Concept2      10        4     6
3 Concept3      16        4     3

Upvotes: 0

Rui Barradas
Rui Barradas

Reputation: 76651

First note that if you apply base function diff to the months columns, you will get one column less but transposed.

apply(df[-1], 1, diff)
#         [,1] [,2] [,3]
#February    4    4    4
#March       7    6    4

So transpose it to get the right orientation.

t(apply(df[-1], 1, diff))
#     February March
#[1,]        4     7
#[2,]        4     6
#[3,]        4     4

And cbind it with the first 2 columns. Since the first argument is a subset of a data.frame, the method called is cbind.data.frame and the result is also a df.

cbind(df[1:2], t(apply(df[-1], 1, diff)))
#  Concepts January February March
#1 Concept1       5        4     7
#2 Concept2      10        4     6
#3 Concept3      15        4     4

Upvotes: 3

Related Questions