Reputation: 109

R data table: replace subset of row values across multiple columns using conditional with another column

This is my first post in stack overflow so forgive any mistakes. I'm also very new to R syntax and data tables.

Specifically for a data table, I want to conditionally test and replace row values across four columns in comparison with values in a fifth column. Example data is the following:

head(loadProfiles)
    load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
 1:   8469.231    2317.895        36700.00        220200.000   8808
 2:   8768.000    2609.524        36533.33         36533.333   8768
 3:   8744.000    3168.116        27325.00         10409.524   8744
 4:   7006.452    3810.526        24133.33          3620.000   8688
 5:   5794.595    4660.870        19490.91          2144.000   8576
 6:   6057.143    5888.889        16307.69          2208.333   8480
 7:   7036.667    7279.310        14073.33          2814.667   8444
 8:   8107.692    8107.692        14053.33          3634.483   8432
 9:   8138.462    9200.000        11755.56          3992.453   8464
10:   8173.077   10625.000        10119.05          4427.083   8500

What I would like to do is loop the following action over each of the first 4 columns, comparing each column to values in the fifth column.

loadProfiles[load_ev_ag >= maxICA, load_ev_ag := maxICA]

The result I want should look like the following:

head(loadProfiles)
    load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
 1:   8469.231    2317.895            8808              8808   8808
 2:   8768.000    2609.524            8768              8768   8768
 3:   8744.000    3168.116            8744              8744   8744
 4:   7006.452    3810.526            8688          3620.000   8688
 5:   5794.595    4660.870            8576          2144.000   8576
 6:   6057.143    5888.889            8480          2208.333   8480
 7:   7036.667    7279.310            8444          2814.667   8444
 8:   8107.692    8107.692            8432          3634.483   8432
 9:   8138.462        8464            8464          3992.453   8464
10:   8173.077        8500            8500          4427.083   8500

I've tried the following with no luck:

loadProfileNames <- colnames(loadProfiles)[1:4]
loadProfiles[i = (loadProfileNames) >= maxICA,j = (loadProfileNames) := maxICA]

This produces the following warning and also changes all values in the first four columns equal to values in the fifth column

Warning message:
In (loadProfileNames) >= maxICA :
  longer object length is not a multiple of shorter object length

I've also tried the following which changes the subset of x rows that meet the criteria i = (loadProfileNames) >= maxICA to the first x entries in maxICA rather than to the value in maxICA corresponding to row i in the subset of x rows

for(j in loadProfileNames) { set(loadProfiles,i=which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),j=j,value=loadProfiles[["maxICA"]]) }

and produces the following warning

Warning messages:
1: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),  :
  Supplied 288 items to be assigned to 24 items of column 'load_ev_ag' (264 unused)
2: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),  :
  Supplied 288 items to be assigned to 108 items of column 'load_ev_res' (180 unused)
3: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),  :
  Supplied 288 items to be assigned to 156 items of column 'load_ev_res_tou' (132 unused)
4: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),  :
  Supplied 288 items to be assigned to 156 items of column 'load_ev_workplace' (132 unused)

I'm pretty much stuck at this point. Any guidance would be much appreciated.

Upvotes: 0

Answers (3)

Uwe

Reputation: 42544

A more "data.table-way" than using get()and eval() modifies loadProfiles by reference. It uses lapply(.SD, ...) together with .SDcols to identify the columns to operate on. pmin() is used instead of ifelse().

    cols_to_change <- stringr::str_subset(names(loadProfiles), "^load_ev")
    loadProfiles[, (cols_to_change) := lapply(.SD, function(x) pmin(x, maxICA)),
                 .SDcols = cols_to_change]
    loadProfiles
#    load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
# 1:   8469.231    2317.895            8808          8808.000   8808
# 2:   8768.000    2609.524            8768          8768.000   8768
# 3:   8744.000    3168.116            8744          8744.000   8744
# 4:   7006.452    3810.526            8688          3620.000   8688
# 5:   5794.595    4660.870            8576          2144.000   8576
# 6:   6057.143    5888.889            8480          2208.333   8480
# 7:   7036.667    7279.310            8444          2814.667   8444
# 8:   8107.692    8107.692            8432          3634.483   8432
# 9:   8138.462    8464.000            8464          3992.453   8464
#10:   8173.077    8500.000            8500          4427.083   8500

The above code could be rewritten to use the set() function:

for (j in cols_to_change) { 
  set(loadProfiles, ,j = j, value = pmin(loadProfiles[[j]], loadProfiles[["maxICA"]])) 
}

Benchmark

Inspired by Frank's comment I was wondering what the best approach is in terms of performance. For benchmarking, a data.table with 100000 rows is created by replicating the OP's data.

# create data.table with 100 000 rows
lp <- copy(loadProfiles0)
dummy <- lapply(1:4, function(x) lp <<- 
                  rbindlist(list(lp, lp, lp, lp, lp, lp, lp, lp, lp, lp)))
nrow(lp)
#100000

As all approaches modify the loadProfiles in place, we need to take a copy before each run. The copy operation is also benchmarked for comparison.

microbenchmark::microbenchmark(
  copy = loadProfiles <- copy(lp),
  chris = {
    loadProfiles <- copy(lp)
    for (i in cols_to_change) { 
      loadProfiles[get(i) >= maxICA, eval(i) := as.double(maxICA)]
    }
  },
  frank = {
    loadProfiles <- copy(lp)
    for (i in cols_to_change) { 
      loadProfiles[get(i) >= maxICA, (i) := as.double(maxICA)]
    }
  },
  uwe = {
    loadProfiles <- copy(lp)
    loadProfiles[, (cols_to_change) := lapply(.SD, function(x) pmin(x, maxICA)),
                 .SDcols = cols_to_change]
  },
  set = {
    loadProfiles <- copy(lp)
    for (j in cols_to_change) { 
      set(loadProfiles, , j = j, value = pmin(loadProfiles[[j]], loadProfiles[["maxICA"]])) 
    }
  }
)

Results:

#Unit: microseconds
#  expr      min        lq      mean    median        uq        max neval
#  copy  592.427  1007.012  1170.425  1111.224  1238.281   3977.826   100
# chris 8525.045 10614.394 12704.450 11499.447 12152.475 140577.520   100
# frank 4972.000  6799.118  8566.945  7339.060  7819.344 133202.589   100
#   uwe 4201.354  6297.689  6711.409  6585.595  6914.846  10546.996   100
#   set 3716.539  5580.662  7138.738  5907.836  6264.840 127311.557   100

Frank's suggestion to remove eval() from christoph's solution has gained a remarkable speed increase. However, the other two solutions are still faster with set slightly ahead.

Data

loadProfiles0 <- fread("load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
         8469.231    2317.895        36700.00        220200.000   8808
         8768.000    2609.524        36533.33         36533.333   8768
         8744.000    3168.116        27325.00         10409.524   8744
         7006.452    3810.526        24133.33          3620.000   8688
         5794.595    4660.870        19490.91          2144.000   8576
         6057.143    5888.889        16307.69          2208.333   8480
         7036.667    7279.310        14073.33          2814.667   8444
         8107.692    8107.692        14053.33          3634.483   8432
         8138.462    9200.000        11755.56          3992.453   8464
         8173.077   10625.000        10119.05          4427.083   8500")

Upvotes: 3

setempler

Reputation: 1751

You could also solve this with lapply and ifelse, even valid for data.frames:

loadProfiles[loadProfileNames] <- lapply(loadProfiles[loadProfileNames],
  function (i) ifelse (i >= loadProfiles$maxICA, loadProfiles$maxICA, i))

And for data.tables, the .SD variable is a good resource:

loadProfile[, lapply(.SD, function(i) ifelse(i >= maxICA, maxICA, i)), .SDcols = loadProfileNames]

Upvotes: 0

maccruiskeen

Reputation: 2808

Your first attempt was almost right:

profilenames <- names(loadProfiles)[1:4]
for (i in profilenames) { 
  loadProfiles[get(i) >= maxICA, eval(i) := as.double(maxICA)]
}

Upvotes: 1

R data table: replace subset of row values across multiple columns using conditional with another column

Answers (3)

Benchmark

Data

Related Questions