Reputation: 1070
I would like to perform some computation on a large dataframe. To do so I need to
How can I do that?
Many thanks in advance for your help!
dataframe <- MyDataFrame
nb_obs <- nrow(dataframe) # in my dataframe I have 153 036 rows
nb_chunk <- ceiling(dataframe / 250) # I thus need 613 chunks if I want 250 obs per sub-dataframe
for(i in 1:nb_chunk) {
# my computation here, I want to add a new columns to the chunk to store my results..
}
# then I want to recombine the final dataset (equals to original dataset with a new column added)
EDITED PART
Thank you for your comments, please find below my proposal in a reproducible example using iris dataset.
I have 2 additional questions at this stage:
Is there a better way to proceed with dplyr?
df <- iris # using iris as an example (my real dataframe is 153036 rows and 17 columns)
nb_obs <- nrow(df) # nb of observations in the dataframe (thus nb of operations to be performed)
nb_obs_in_chunk <- 13 # nb of rows per chunk
nb_chunk <- ceiling(nb_obs / nb_obs_in_chunk) # total nb of chunks to be created
nb_chunk_full <- floor(nb_obs / nb_obs_in_chunk) # nb of chunks to be created with nb_obs_in_chunk rows
nb_obs_last_chunk <- nb_obs - nb_obs_in_chunk*nb_chunk_full # nb of rows in final chunks
df$split.factor <- as.factor(c(rep(1:nb_chunk_full, each = nb_obs_in_chunk), rep(nb_chunk_full + 1, nb_obs_last_chunk))) # create factor to split dataframe into equal parts
final.df <- data.frame(Sepal.Length = numeric(), Sepal.Width = numeric(), Petal.Length = numeric(), Petal.Width = numeric(), Species = factor(), split.factor = factor()) # initiate final dataframe (desired output)
for(i in 1:nb_chunk) {
temp_i <- df[df$split.factor == i, ]
temp_i$NEW <- temp_i$Sepal.Length + temp_i$Sepal.Width
final.df <- rbind(final.df, temp_i)
}
Upvotes: 1
Views: 1646
Reputation: 226557
I think this does it:
library("dplyr")
ddf <- iris
nb_obs_in_chunk <- 13
ddf %>% mutate(id=seq(nrow(ddf)),
chunk=cut(id,id %/% nb_obs_in_chunk)) %>%
group_by(chunk) %>%
mutate(NEW=Sepal.Length+Sepal.Width)
but haven't tested yet ... @mjkallen's rep(seq(13), length.out = 150, each = 13))
might be better than my %/%
-based solution to getting the chunk numbers.
Upvotes: 2
Reputation: 478
To answer your initial questions:
df <- iris
do.call(rbind,
lapply(split(df, rep(seq(13), length.out = 150, each = 13)),
function(chunk) {
chunk$NEW = chunk$Sepal.Length + chunk$Sepal.Width
chunk
})
)
rep(seq(13), length.out = 150, each = 13)
gives you the same as the split.factor
column in your example and the split()
function turns this into a factor.
This also answers the first of the questions in your edit: yes, you don't need to initialize the final data frame.
Note that the row names are no longer a sequence from 1 to 150 as they now contain the chunk number.
Upvotes: 5