Pedr Nton
Pedr Nton

Reputation: 79

Optimise and improve code using for-loop and/or functions in r

I have a df of 416x8 and I aim to do several waterfall charts according to a column condition. I managed to do it but in a super long code so far (~800 lines). I want to optimise what I have done using functions and loops but I am stuck!

My df structure looks like this, I copied only one of the list, the remaining 8 has same structure and only a different list name:

List of 9
 $ R1 : grouped_df [38 x 8] (S3: grouped_df/tbl_df/tbl/data.frame)
  ..$ chemical_name: Factor w/ 38 levels "1,3-Diphenylguanidine",..: 24 28 11 27 1 13 3 33 26 21 ...
  ..$ cas_number   : chr [1:38] "2164-08-1" "112-18-5" "613-62-7" "7311-30-0" ...
  ..$ Type         : chr [1:38] "Pesticide" "Industrial" "Industrial" "Industrial" ...
  ..$ sites        : chr [1:38] "R1" "R1" "R1" "R1" ...
  ..$ conc_uM      : num [1:38] 0.0000393 0.0000131 0.0003774 0.00001 0.0002736 ...
  ..$ ECOSAR_uM    : num [1:38] 0.0891 0.0701 3.1158 0.0989 4.4114 ...
  ..$ TU_ecosar    : num [1:38] 0.000441 0.000187 0.000121 0.000101 0.000062 ...
  ..$ percent      : num [1:38] 0.3932 0.167 0.1081 0.0906 0.0554 ...
  ..- attr(*, "groups")= tibble [1 x 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ sites: chr "R1"
  .. ..$ .rows: list<int> [1:1] 
  .. .. ..$ : int [1:38] 1 2 3 4 5 6 7 8 9 10 ...
  .. .. ..@ ptype: int(0) 
  .. ..- attr(*, ".drop")= logi TRUE 

example of old code:

#1 add sequence of number according to percentage under new column "id"
df1$id    <- seq_along(dt1$percent)
df2$id ..... and so on until df9

#2 repeat a "name" per df under new column "group"
df1$group <- rep("RS1", nrow(df1))
df2$group <- rep("RS2", nrow(df2)) ..... and so on until df9

#3 I did a cumulative sum of the percentages
df1$end <- cumsum(df1$percent)
df2$end <- cumsum(df2$percent) ..... and so on until df9

#4 and I inserted an starting value
df1$start <- c(0, head(df1$end, -0.000001))
df2$start <- c(0, head(df2$end, -0.000001))..... and so on until df9

New part!

I managed to improve the first section and I am happy with the function and also included a loop at the end to split the lists to individual dfs.

#split my df according the one column criteria, which created a list of 9
df_split <- split(df, df$sites)

Then, I defined a function in order to convert a specific to column to factor in each of the lists:

#My function
col.asfactor = function(x) {
  x = mutate(x, chemical_name = as.factor(chemical_name))
}

I applied the function and it worked out fine

#apply as.factor function
df_split<- lapply(df_split, col.asfactor)

How To Continue here!!!!

As you can see, my old code is long and tedious (#1, #2, #3, #4). Basically, I do not know how I can integrate into a function or loop -> seq_along, rep, cumsum, and last part in order to optimise my code.

I want to finish the code splitting my df into 9 independent ones usinf the following loop, which is working using a dummy_df:

new_dfs<-c("new_df1","new_df2"..... until "new_df9)

# for in Loop
for (i in 1:length(df_split)) {
   assign(new_dfs[i], df_split[[i]])
}

Any suggestion is welcome and sorry for such long post.

Upvotes: 1

Views: 78

Answers (2)

David J. Bosak
David J. Bosak

Reputation: 1624

I think there are two things that can help you here:

  1. The double bracket syntax
  2. The within function

The double bracket syntax allows you to access elements of a list dynamically. The within function allows you to access variables on a data frame with just the variable name. Here is an example:

# Create sample data
df1 <- data.frame(percent = runif(10))
df2 <- data.frame(percent = runif(10))
df3 <- data.frame(percent = runif(10))

# Put data frames in list
lst <- list(df1, df2, df3)

# View original data frames
lst
# [[1]]
# percent
# 1  0.2138321
# 2  0.6917669
# 3  0.9728134
# 4  0.5561451
# 5  0.5280783
# 6  0.2165940
# 7  0.6805653
# 8  0.7460168
# 9  0.4642024
# 10 0.1374181
# 
# [[2]]
# percent
# 1  0.81437961
# 2  0.83216126
# 3  0.35431008
# 4  0.97873284
# 5  0.98803502
# 6  0.63465900
# 7  0.94215238
# 8  0.01400746
# 9  0.02533159
# 10 0.48631865
# 
# [[3]]
# percent
# 1  0.26785198
# 2  0.76407906
# 3  0.48805437
# 4  0.74689735
# 5  0.04256571
# 6  0.44064283
# 7  0.14606584
# 8  0.44194330
# 9  0.28411423
# 10 0.07362424

# Loop through list and perform operations on all data frames
for (i in seq_along(lst)) {
  
  lst[[i]] <- within(lst[[i]], {
         
    id <- seq_along(percent)
    group <- rep(paste0("RS", i), nrow(df))
    end <- cumsum(percent)
    start <- c(0, head(end, -0.000001))
  }     
  )
  
}

# View results
lst
[[1]]
     percent     start       end group id
1  0.2138321 0.0000000 0.2138321   RS1  1
2  0.6917669 0.2138321 0.9055990   RS1  2
3  0.9728134 0.9055990 1.8784124   RS1  3
4  0.5561451 1.8784124 2.4345575   RS1  4
5  0.5280783 2.4345575 2.9626358   RS1  5
6  0.2165940 2.9626358 3.1792298   RS1  6
7  0.6805653 3.1792298 3.8597950   RS1  7
8  0.7460168 3.8597950 4.6058119   RS1  8
9  0.4642024 4.6058119 5.0700142   RS1  9
10 0.1374181 5.0700142 5.2074323   RS1 10

[[2]]
      percent     start       end group id
1  0.81437961 0.0000000 0.8143796   RS2  1
2  0.83216126 0.8143796 1.6465409   RS2  2
3  0.35431008 1.6465409 2.0008510   RS2  3
4  0.97873284 2.0008510 2.9795838   RS2  4
5  0.98803502 2.9795838 3.9676188   RS2  5
6  0.63465900 3.9676188 4.6022778   RS2  6
7  0.94215238 4.6022778 5.5444302   RS2  7
8  0.01400746 5.5444302 5.5584377   RS2  8
9  0.02533159 5.5584377 5.5837692   RS2  9
10 0.48631865 5.5837692 6.0700879   RS2 10

[[3]]
      percent    start      end group id
1  0.26785198 0.000000 0.267852   RS3  1
2  0.76407906 0.267852 1.031931   RS3  2
3  0.48805437 1.031931 1.519985   RS3  3
4  0.74689735 1.519985 2.266883   RS3  4
5  0.04256571 2.266883 2.309448   RS3  5
6  0.44064283 2.309448 2.750091   RS3  6
7  0.14606584 2.750091 2.896157   RS3  7
8  0.44194330 2.896157 3.338100   RS3  8
9  0.28411423 3.338100 3.622215   RS3  9
10 0.07362424 3.622215 3.695839   RS3 10

Upvotes: 1

rdodhia
rdodhia

Reputation: 350

So you have 1 df that's 416x8, and you want to create 9 dfs that are Nx4? As the comment says, it would be easier to help if we could see the original df and examples of your desired end result.

One idea is to use data.table or dplyr. You can compute a new data.table with the 4 columns you want, then split them.

Roughly...

library(data.table)
dt=setDT(df)
dt.new=dt[,.(.N,cumsum(percent),c(0, head(end, -0.000001),by=.(percent)] 
#left out the "name" column for now

Upvotes: 0

Related Questions