Optimise and improve code using for-loop and/or functions in r

Question

I have a df of 416x8 and I aim to do several waterfall charts according to a column condition. I managed to do it but in a super long code so far (~800 lines). I want to optimise what I have done using functions and loops but I am stuck!

My df structure looks like this, I copied only one of the list, the remaining 8 has same structure and only a different list name:

List of 9
 $ R1 : grouped_df [38 x 8] (S3: grouped_df/tbl_df/tbl/data.frame)
  ..$ chemical_name: Factor w/ 38 levels "1,3-Diphenylguanidine",..: 24 28 11 27 1 13 3 33 26 21 ...
  ..$ cas_number   : chr [1:38] "2164-08-1" "112-18-5" "613-62-7" "7311-30-0" ...
  ..$ Type         : chr [1:38] "Pesticide" "Industrial" "Industrial" "Industrial" ...
  ..$ sites        : chr [1:38] "R1" "R1" "R1" "R1" ...
  ..$ conc_uM      : num [1:38] 0.0000393 0.0000131 0.0003774 0.00001 0.0002736 ...
  ..$ ECOSAR_uM    : num [1:38] 0.0891 0.0701 3.1158 0.0989 4.4114 ...
  ..$ TU_ecosar    : num [1:38] 0.000441 0.000187 0.000121 0.000101 0.000062 ...
  ..$ percent      : num [1:38] 0.3932 0.167 0.1081 0.0906 0.0554 ...
  ..- attr(*, "groups")= tibble [1 x 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ sites: chr "R1"
  .. ..$ .rows: list [1:1] 
  .. .. ..$ : int [1:38] 1 2 3 4 5 6 7 8 9 10 ...
  .. .. ..@ ptype: int(0) 
  .. ..- attr(*, ".drop")= logi TRUE

example of old code:

#1 add sequence of number according to percentage under new column "id"
df1$id    <- seq_along(dt1$percent)
df2$id ..... and so on until df9

#2 repeat a "name" per df under new column "group"
df1$group <- rep("RS1", nrow(df1))
df2$group <- rep("RS2", nrow(df2)) ..... and so on until df9

#3 I did a cumulative sum of the percentages
df1$end <- cumsum(df1$percent)
df2$end <- cumsum(df2$percent) ..... and so on until df9

#4 and I inserted an starting value
df1$start <- c(0, head(df1$end, -0.000001))
df2$start <- c(0, head(df2$end, -0.000001))..... and so on until df9

New part!

I managed to improve the first section and I am happy with the function and also included a loop at the end to split the lists to individual dfs.

#split my df according the one column criteria, which created a list of 9
df_split <- split(df, df$sites)

Then, I defined a function in order to convert a specific to column to factor in each of the lists:

#My function
col.asfactor = function(x) {
  x = mutate(x, chemical_name = as.factor(chemical_name))
}

I applied the function and it worked out fine

#apply as.factor function
df_split<- lapply(df_split, col.asfactor)

How To Continue here!!!!

As you can see, my old code is long and tedious (#1, #2, #3, #4). Basically, I do not know how I can integrate into a function or loop -> seq_along, rep, cumsum, and last part in order to optimise my code.

I want to finish the code splitting my df into 9 independent ones usinf the following loop, which is working using a dummy_df:

new_dfs<-c("new_df1","new_df2"..... until "new_df9)

# for in Loop
for (i in 1:length(df_split)) {
   assign(new_dfs[i], df_split[[i]])
}

Any suggestion is welcome and sorry for such long post.

David J. Bosak · Accepted Answer

I think there are two things that can help you here:

The double bracket syntax
The within function

The double bracket syntax allows you to access elements of a list dynamically. The within function allows you to access variables on a data frame with just the variable name. Here is an example:

# Create sample data
df1 <- data.frame(percent = runif(10))
df2 <- data.frame(percent = runif(10))
df3 <- data.frame(percent = runif(10))

# Put data frames in list
lst <- list(df1, df2, df3)

# View original data frames
lst
# [[1]]
# percent
# 1  0.2138321
# 2  0.6917669
# 3  0.9728134
# 4  0.5561451
# 5  0.5280783
# 6  0.2165940
# 7  0.6805653
# 8  0.7460168
# 9  0.4642024
# 10 0.1374181
# 
# [[2]]
# percent
# 1  0.81437961
# 2  0.83216126
# 3  0.35431008
# 4  0.97873284
# 5  0.98803502
# 6  0.63465900
# 7  0.94215238
# 8  0.01400746
# 9  0.02533159
# 10 0.48631865
# 
# [[3]]
# percent
# 1  0.26785198
# 2  0.76407906
# 3  0.48805437
# 4  0.74689735
# 5  0.04256571
# 6  0.44064283
# 7  0.14606584
# 8  0.44194330
# 9  0.28411423
# 10 0.07362424

# Loop through list and perform operations on all data frames
for (i in seq_along(lst)) {
  
  lst[[i]] <- within(lst[[i]], {
         
    id <- seq_along(percent)
    group <- rep(paste0("RS", i), nrow(df))
    end <- cumsum(percent)
    start <- c(0, head(end, -0.000001))
  }     
  )
  
}

# View results
lst
[[1]]
     percent     start       end group id
1  0.2138321 0.0000000 0.2138321   RS1  1
2  0.6917669 0.2138321 0.9055990   RS1  2
3  0.9728134 0.9055990 1.8784124   RS1  3
4  0.5561451 1.8784124 2.4345575   RS1  4
5  0.5280783 2.4345575 2.9626358   RS1  5
6  0.2165940 2.9626358 3.1792298   RS1  6
7  0.6805653 3.1792298 3.8597950   RS1  7
8  0.7460168 3.8597950 4.6058119   RS1  8
9  0.4642024 4.6058119 5.0700142   RS1  9
10 0.1374181 5.0700142 5.2074323   RS1 10

[[2]]
      percent     start       end group id
1  0.81437961 0.0000000 0.8143796   RS2  1
2  0.83216126 0.8143796 1.6465409   RS2  2
3  0.35431008 1.6465409 2.0008510   RS2  3
4  0.97873284 2.0008510 2.9795838   RS2  4
5  0.98803502 2.9795838 3.9676188   RS2  5
6  0.63465900 3.9676188 4.6022778   RS2  6
7  0.94215238 4.6022778 5.5444302   RS2  7
8  0.01400746 5.5444302 5.5584377   RS2  8
9  0.02533159 5.5584377 5.5837692   RS2  9
10 0.48631865 5.5837692 6.0700879   RS2 10

[[3]]
      percent    start      end group id
1  0.26785198 0.000000 0.267852   RS3  1
2  0.76407906 0.267852 1.031931   RS3  2
3  0.48805437 1.031931 1.519985   RS3  3
4  0.74689735 1.519985 2.266883   RS3  4
5  0.04256571 2.266883 2.309448   RS3  5
6  0.44064283 2.309448 2.750091   RS3  6
7  0.14606584 2.750091 2.896157   RS3  7
8  0.44194330 2.896157 3.338100   RS3  8
9  0.28411423 3.338100 3.622215   RS3  9
10 0.07362424 3.622215 3.695839   RS3 10

Optimise and improve code using for-loop and/or functions in r

Answers (2)

Related Questions