Reputation: 109
This is my first post in stack overflow so forgive any mistakes. I'm also very new to R syntax and data tables.
Specifically for a data table, I want to conditionally test and replace row values across four columns in comparison with values in a fifth column. Example data is the following:
head(loadProfiles)
load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
1: 8469.231 2317.895 36700.00 220200.000 8808
2: 8768.000 2609.524 36533.33 36533.333 8768
3: 8744.000 3168.116 27325.00 10409.524 8744
4: 7006.452 3810.526 24133.33 3620.000 8688
5: 5794.595 4660.870 19490.91 2144.000 8576
6: 6057.143 5888.889 16307.69 2208.333 8480
7: 7036.667 7279.310 14073.33 2814.667 8444
8: 8107.692 8107.692 14053.33 3634.483 8432
9: 8138.462 9200.000 11755.56 3992.453 8464
10: 8173.077 10625.000 10119.05 4427.083 8500
What I would like to do is loop the following action over each of the first 4 columns, comparing each column to values in the fifth column.
loadProfiles[load_ev_ag >= maxICA, load_ev_ag := maxICA]
The result I want should look like the following:
head(loadProfiles)
load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
1: 8469.231 2317.895 8808 8808 8808
2: 8768.000 2609.524 8768 8768 8768
3: 8744.000 3168.116 8744 8744 8744
4: 7006.452 3810.526 8688 3620.000 8688
5: 5794.595 4660.870 8576 2144.000 8576
6: 6057.143 5888.889 8480 2208.333 8480
7: 7036.667 7279.310 8444 2814.667 8444
8: 8107.692 8107.692 8432 3634.483 8432
9: 8138.462 8464 8464 3992.453 8464
10: 8173.077 8500 8500 4427.083 8500
I've tried the following with no luck:
loadProfileNames <- colnames(loadProfiles)[1:4]
loadProfiles[i = (loadProfileNames) >= maxICA,j = (loadProfileNames) := maxICA]
This produces the following warning and also changes all values in the first four columns equal to values in the fifth column
Warning message:
In (loadProfileNames) >= maxICA :
longer object length is not a multiple of shorter object length
I've also tried the following which changes the subset of x rows that meet the criteria i = (loadProfileNames) >= maxICA
to the first x entries in maxICA rather than to the value in maxICA corresponding to row i in the subset of x rows
for(j in loadProfileNames) { set(loadProfiles,i=which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),j=j,value=loadProfiles[["maxICA"]]) }
and produces the following warning
Warning messages:
1: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]), :
Supplied 288 items to be assigned to 24 items of column 'load_ev_ag' (264 unused)
2: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]), :
Supplied 288 items to be assigned to 108 items of column 'load_ev_res' (180 unused)
3: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]), :
Supplied 288 items to be assigned to 156 items of column 'load_ev_res_tou' (132 unused)
4: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]), :
Supplied 288 items to be assigned to 156 items of column 'load_ev_workplace' (132 unused)
I'm pretty much stuck at this point. Any guidance would be much appreciated.
Upvotes: 0
Views: 1452
Reputation: 42544
A more "data.table
-way" than using get()
and eval()
modifies loadProfiles
by reference. It uses lapply(.SD, ...)
together with .SDcols
to identify the columns to operate on. pmin()
is used instead of ifelse()
.
cols_to_change <- stringr::str_subset(names(loadProfiles), "^load_ev")
loadProfiles[, (cols_to_change) := lapply(.SD, function(x) pmin(x, maxICA)),
.SDcols = cols_to_change]
loadProfiles
# load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
# 1: 8469.231 2317.895 8808 8808.000 8808
# 2: 8768.000 2609.524 8768 8768.000 8768
# 3: 8744.000 3168.116 8744 8744.000 8744
# 4: 7006.452 3810.526 8688 3620.000 8688
# 5: 5794.595 4660.870 8576 2144.000 8576
# 6: 6057.143 5888.889 8480 2208.333 8480
# 7: 7036.667 7279.310 8444 2814.667 8444
# 8: 8107.692 8107.692 8432 3634.483 8432
# 9: 8138.462 8464.000 8464 3992.453 8464
#10: 8173.077 8500.000 8500 4427.083 8500
The above code could be rewritten to use the set()
function:
for (j in cols_to_change) {
set(loadProfiles, ,j = j, value = pmin(loadProfiles[[j]], loadProfiles[["maxICA"]]))
}
Inspired by Frank's comment I was wondering what the best approach is in terms of performance. For benchmarking, a data.table with 100000 rows is created by replicating the OP's data.
# create data.table with 100 000 rows
lp <- copy(loadProfiles0)
dummy <- lapply(1:4, function(x) lp <<-
rbindlist(list(lp, lp, lp, lp, lp, lp, lp, lp, lp, lp)))
nrow(lp)
#100000
As all approaches modify the loadProfiles
in place, we need to take a copy before each run. The copy operation is also benchmarked for comparison.
microbenchmark::microbenchmark(
copy = loadProfiles <- copy(lp),
chris = {
loadProfiles <- copy(lp)
for (i in cols_to_change) {
loadProfiles[get(i) >= maxICA, eval(i) := as.double(maxICA)]
}
},
frank = {
loadProfiles <- copy(lp)
for (i in cols_to_change) {
loadProfiles[get(i) >= maxICA, (i) := as.double(maxICA)]
}
},
uwe = {
loadProfiles <- copy(lp)
loadProfiles[, (cols_to_change) := lapply(.SD, function(x) pmin(x, maxICA)),
.SDcols = cols_to_change]
},
set = {
loadProfiles <- copy(lp)
for (j in cols_to_change) {
set(loadProfiles, , j = j, value = pmin(loadProfiles[[j]], loadProfiles[["maxICA"]]))
}
}
)
Results:
#Unit: microseconds
# expr min lq mean median uq max neval
# copy 592.427 1007.012 1170.425 1111.224 1238.281 3977.826 100
# chris 8525.045 10614.394 12704.450 11499.447 12152.475 140577.520 100
# frank 4972.000 6799.118 8566.945 7339.060 7819.344 133202.589 100
# uwe 4201.354 6297.689 6711.409 6585.595 6914.846 10546.996 100
# set 3716.539 5580.662 7138.738 5907.836 6264.840 127311.557 100
Frank's suggestion to remove eval()
from christoph's solution has gained a remarkable speed increase. However, the other two solutions are still faster with set
slightly ahead.
loadProfiles0 <- fread("load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
8469.231 2317.895 36700.00 220200.000 8808
8768.000 2609.524 36533.33 36533.333 8768
8744.000 3168.116 27325.00 10409.524 8744
7006.452 3810.526 24133.33 3620.000 8688
5794.595 4660.870 19490.91 2144.000 8576
6057.143 5888.889 16307.69 2208.333 8480
7036.667 7279.310 14073.33 2814.667 8444
8107.692 8107.692 14053.33 3634.483 8432
8138.462 9200.000 11755.56 3992.453 8464
8173.077 10625.000 10119.05 4427.083 8500")
Upvotes: 3
Reputation: 1751
You could also solve this with lapply
and ifelse
, even valid for data.frames
:
loadProfiles[loadProfileNames] <- lapply(loadProfiles[loadProfileNames],
function (i) ifelse (i >= loadProfiles$maxICA, loadProfiles$maxICA, i))
And for data.tables
, the .SD
variable is a good resource:
loadProfile[, lapply(.SD, function(i) ifelse(i >= maxICA, maxICA, i)), .SDcols = loadProfileNames]
Upvotes: 0
Reputation: 2808
Your first attempt was almost right:
profilenames <- names(loadProfiles)[1:4]
for (i in profilenames) {
loadProfiles[get(i) >= maxICA, eval(i) := as.double(maxICA)]
}
Upvotes: 1