Speeding up dplyr pipe including checks with mutate_if and if_else on larger tables

Question

I wrote some code to performed oversampling, meaning that I replicate my observations in a data.frame and add noise to the replicates, so they are not exactly the same anymore. I'm quite happy that it works now as intended, but...it is too slow. I'm just learning dplyr and have no clue about data.table, but I hope there is a way to improve my function. I'm running this code in a function for 100s of data.frames which may contain about 10,000 columns and 400 rows.

This is some toy data:

library(tidyverse)

train_set1 <- rep(0, 300)
train_set2 <- rep("Factor1", 300)
train_set3 <- data.frame(replicate(1000, sample(0:1, 300, rep = TRUE)))
train_set <- cbind(train_set1, train_set2, train_set3)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))

This is the code to replicate each row a given number of times and a function to determine whether the added noise later will be positive or negative:

# replicate each row twice, added row.names contain a "."
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]

# create a flip function
flip <- function() {
  sample(c(-1,1), 1)
}

In the relevant "too slow" piece of code, I'm subsetting the row.names for the added "." to filter for the replicates. Than I select only the numeric columns. I go through those columns row by row and leave the values untouched if they are 0. If not, a certain amount is added (here +- 1 %). Later on, I combine this data set with the original data set and have my oversampled data.frame.

# add percentage of noise to non-zero values in numerical columns
noised_copies <- train_oversampled %>% 
  rownames_to_column(var = "rowname") %>%
  filter(grepl("\.", row.names(train_oversampled))) %>% 
  rowwise() %>%
  mutate_if(~ is.numeric(.), ~ if_else(. == 0, 0,. + (. * flip() * 0.01 ))) %>%
  ungroup() %>%
  column_to_rownames(var = "rowname")
# combine original and oversampled, noised data set
train_noised <- rbind(noised_copies, train_set)

I assume there are faster ways using e.g. data.table, but it was already tough work to get this code running and I have no idea how to improve its performance.

EDIT:

The solution is working perfectly fine with fixed values, but called within a for loop I receive "Error in paste(Sample, n, sep = ".") : object 'Sample' not found"

Code to replicate:

library(data.table)

train_set <- data.frame(
  x = c(rep(0, 10)), 
  y = c(0:9), 
  z = c(rep("Factor1", 10)))

# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = train_set, cc = train_set)

for(current_table in train_list) {
  setDT(current_table, keep.rownames="Sample")
  cols <- names(current_table)[sapply(current_table, is.numeric)]
  noised_copies <- lapply(c(1,2), function(n) {
    copy(current_table)[,
      c("Sample", cols) := c(.(paste(Sample, n, sep=".")), 
        .SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
      .SDcols=cols]
  })
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
# As this is an example, I did not write anything to actually 
# store the results, so I have to remove the object
rm(train_noised)
}

Any ideas why the column Sample can't be found now?

chinsoon12 · Accepted Answer

Here is a more vectorized approach using data.table:

library(data.table)
setDT(train_set, keep.rownames="Sample")
cols <- names(train_set)[sapply(train_set, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
    copy(train_set)[,
        c("Sample", cols) := c(.(paste(Sample, n, sep=".")), 
            .SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
        .SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)

With data.table version >= 1.12.9, you can pass is.numeric directly to .SDcols argument and maybe a shorter way (e.g. (.SD) or names(.SD)) to pass to the left hand side of :=

address OP's updated post:

The issue is that although each data.frame within the list is converted to a data.table, the train_list is not updated. You can update the list with a left bind before the for loop:

library(data.table)

train_set <- data.frame(
    x = c(rep(0, 10)), 
    y = c(0:9), 
    z = c(rep("Factor1", 10)))

# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = copy(train_set), cc = copy(train_set))

train_list <- lapply(train_list, setDT, keep.rownames="Sample")

for(current_table in train_list) {
    cols <- names(current_table)[sapply(current_table, is.numeric)]
    noised_copies <- lapply(c(1,2), function(n) {
        copy(current_table)[,
            c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
                .SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
            .SDcols=cols]
    })
    train_noised <- rbindlist(c(noised_copies, train_list), use.names=FALSE)
    # As this is an example, I did not write anything to actually
    # store the results, so I have to remove the object
    rm(train_noised)
}

Speeding up dplyr pipe including checks with mutate_if and if_else on larger tables

Answers (1)

Related Questions