Reputation: 79
I have a df of 416x8 and I aim to do several waterfall charts according to a column condition. I managed to do it but in a super long code so far (~800 lines). I want to optimise what I have done using functions and loops but I am stuck!
My df structure looks like this, I copied only one of the list, the remaining 8 has same structure and only a different list name:
List of 9
$ R1 : grouped_df [38 x 8] (S3: grouped_df/tbl_df/tbl/data.frame)
..$ chemical_name: Factor w/ 38 levels "1,3-Diphenylguanidine",..: 24 28 11 27 1 13 3 33 26 21 ...
..$ cas_number : chr [1:38] "2164-08-1" "112-18-5" "613-62-7" "7311-30-0" ...
..$ Type : chr [1:38] "Pesticide" "Industrial" "Industrial" "Industrial" ...
..$ sites : chr [1:38] "R1" "R1" "R1" "R1" ...
..$ conc_uM : num [1:38] 0.0000393 0.0000131 0.0003774 0.00001 0.0002736 ...
..$ ECOSAR_uM : num [1:38] 0.0891 0.0701 3.1158 0.0989 4.4114 ...
..$ TU_ecosar : num [1:38] 0.000441 0.000187 0.000121 0.000101 0.000062 ...
..$ percent : num [1:38] 0.3932 0.167 0.1081 0.0906 0.0554 ...
..- attr(*, "groups")= tibble [1 x 2] (S3: tbl_df/tbl/data.frame)
.. ..$ sites: chr "R1"
.. ..$ .rows: list<int> [1:1]
.. .. ..$ : int [1:38] 1 2 3 4 5 6 7 8 9 10 ...
.. .. ..@ ptype: int(0)
.. ..- attr(*, ".drop")= logi TRUE
example of old code:
#1 add sequence of number according to percentage under new column "id"
df1$id <- seq_along(dt1$percent)
df2$id ..... and so on until df9
#2 repeat a "name" per df under new column "group"
df1$group <- rep("RS1", nrow(df1))
df2$group <- rep("RS2", nrow(df2)) ..... and so on until df9
#3 I did a cumulative sum of the percentages
df1$end <- cumsum(df1$percent)
df2$end <- cumsum(df2$percent) ..... and so on until df9
#4 and I inserted an starting value
df1$start <- c(0, head(df1$end, -0.000001))
df2$start <- c(0, head(df2$end, -0.000001))..... and so on until df9
New part!
I managed to improve the first section and I am happy with the function and also included a loop at the end to split the lists to individual dfs.
#split my df according the one column criteria, which created a list of 9
df_split <- split(df, df$sites)
Then, I defined a function in order to convert a specific to column to factor in each of the lists:
#My function
col.asfactor = function(x) {
x = mutate(x, chemical_name = as.factor(chemical_name))
}
I applied the function and it worked out fine
#apply as.factor function
df_split<- lapply(df_split, col.asfactor)
How To Continue here!!!!
As you can see, my old code is long and tedious (#1, #2, #3, #4). Basically, I do not know how I can integrate into a function or loop -> seq_along, rep, cumsum, and last part in order to optimise my code.
I want to finish the code splitting my df into 9 independent ones usinf the following loop, which is working using a dummy_df:
new_dfs<-c("new_df1","new_df2"..... until "new_df9)
# for in Loop
for (i in 1:length(df_split)) {
assign(new_dfs[i], df_split[[i]])
}
Any suggestion is welcome and sorry for such long post.
Upvotes: 1
Views: 78
Reputation: 1624
I think there are two things that can help you here:
within
functionThe double bracket syntax allows you to access elements of a list dynamically. The within
function allows you to access variables on a data frame with just the variable name. Here is an example:
# Create sample data
df1 <- data.frame(percent = runif(10))
df2 <- data.frame(percent = runif(10))
df3 <- data.frame(percent = runif(10))
# Put data frames in list
lst <- list(df1, df2, df3)
# View original data frames
lst
# [[1]]
# percent
# 1 0.2138321
# 2 0.6917669
# 3 0.9728134
# 4 0.5561451
# 5 0.5280783
# 6 0.2165940
# 7 0.6805653
# 8 0.7460168
# 9 0.4642024
# 10 0.1374181
#
# [[2]]
# percent
# 1 0.81437961
# 2 0.83216126
# 3 0.35431008
# 4 0.97873284
# 5 0.98803502
# 6 0.63465900
# 7 0.94215238
# 8 0.01400746
# 9 0.02533159
# 10 0.48631865
#
# [[3]]
# percent
# 1 0.26785198
# 2 0.76407906
# 3 0.48805437
# 4 0.74689735
# 5 0.04256571
# 6 0.44064283
# 7 0.14606584
# 8 0.44194330
# 9 0.28411423
# 10 0.07362424
# Loop through list and perform operations on all data frames
for (i in seq_along(lst)) {
lst[[i]] <- within(lst[[i]], {
id <- seq_along(percent)
group <- rep(paste0("RS", i), nrow(df))
end <- cumsum(percent)
start <- c(0, head(end, -0.000001))
}
)
}
# View results
lst
[[1]]
percent start end group id
1 0.2138321 0.0000000 0.2138321 RS1 1
2 0.6917669 0.2138321 0.9055990 RS1 2
3 0.9728134 0.9055990 1.8784124 RS1 3
4 0.5561451 1.8784124 2.4345575 RS1 4
5 0.5280783 2.4345575 2.9626358 RS1 5
6 0.2165940 2.9626358 3.1792298 RS1 6
7 0.6805653 3.1792298 3.8597950 RS1 7
8 0.7460168 3.8597950 4.6058119 RS1 8
9 0.4642024 4.6058119 5.0700142 RS1 9
10 0.1374181 5.0700142 5.2074323 RS1 10
[[2]]
percent start end group id
1 0.81437961 0.0000000 0.8143796 RS2 1
2 0.83216126 0.8143796 1.6465409 RS2 2
3 0.35431008 1.6465409 2.0008510 RS2 3
4 0.97873284 2.0008510 2.9795838 RS2 4
5 0.98803502 2.9795838 3.9676188 RS2 5
6 0.63465900 3.9676188 4.6022778 RS2 6
7 0.94215238 4.6022778 5.5444302 RS2 7
8 0.01400746 5.5444302 5.5584377 RS2 8
9 0.02533159 5.5584377 5.5837692 RS2 9
10 0.48631865 5.5837692 6.0700879 RS2 10
[[3]]
percent start end group id
1 0.26785198 0.000000 0.267852 RS3 1
2 0.76407906 0.267852 1.031931 RS3 2
3 0.48805437 1.031931 1.519985 RS3 3
4 0.74689735 1.519985 2.266883 RS3 4
5 0.04256571 2.266883 2.309448 RS3 5
6 0.44064283 2.309448 2.750091 RS3 6
7 0.14606584 2.750091 2.896157 RS3 7
8 0.44194330 2.896157 3.338100 RS3 8
9 0.28411423 3.338100 3.622215 RS3 9
10 0.07362424 3.622215 3.695839 RS3 10
Upvotes: 1
Reputation: 350
So you have 1 df that's 416x8, and you want to create 9 dfs that are Nx4? As the comment says, it would be easier to help if we could see the original df and examples of your desired end result.
One idea is to use data.table or dplyr. You can compute a new data.table with the 4 columns you want, then split them.
Roughly...
library(data.table)
dt=setDT(df)
dt.new=dt[,.(.N,cumsum(percent),c(0, head(end, -0.000001),by=.(percent)]
#left out the "name" column for now
Upvotes: 0