Reputation: 85
I have a dataset with 79003 rows and 97 columns. My dataset looks like as follows:
col1 col2 1 2 3 4 5 6 7 8
str_11 str_44 0.2064191 0 0.6061358 0.92798677 2.7899374 1.098612 1.395511 0.000000
str_11 str_22 0.9044563 0 1.7917595 0.00000000 1.1412787 1.504077 1.008228 0.000000
str_11 str_18 0.8266786 0 0.5389965 0.81676114 0.2787134 0.000000 3.295837 0.000000
str_11 str_1 0.8176492 0 5.0673306 4.45461768 0.8664189 6.549293 1.686399 2.079442
I am trying to iterate through each row and each column. I want to calculate the minimum and maximum value of column-wise and do the following calculation:
for (i in 1:nrow(log_trans2)){
for (j in 3:ncol(log_trans2)){
log_trans2[i, j] = log_trans2[i, ..j] -
min(log_trans2[i, 3:ncol(log_trans2)]) /
(max(log_trans2[i, 3:ncol(log_trans2)]) - min(log_trans2[i, 3:ncol(log_trans2)]))
}}
I added ..j after getting the error as
"Error in
[.data.table
(log_trans2, i, j) : j (the 2nd argument inside [...]) is a single symbol but column name 'j' is not found. Perhaps you intended DT[, ..j]. This difference to data.frame is deliberate and explained in FAQ 1.1
.
but it took more execution (like hours) . How do I reduce the timing with foreach or apply function?
The formula:
=(r-min(col))/(max(col)-min(col))
The expected outcome would be
col1 col2 1 2 3 4 5 6 7 8
Str_11 Str_44 0.029847796820572 0 0.080259104746805 0.11295123566895 0.405795371744574 0.138441206009843 0.167481921848205 0
Str_11 Str_22 0.130782597207229 0 0.237248831160936 0 0.165998575836442 0.189535736027761 0.121002270272179 0
Str_11 Str_18 0.119536094709514 0 0.071369116220582 0.099413248590107 0.040538762756246 0 0.39554907557078 0
Str_11 Str_1 0.118230460268521 0 0.670970792433184 0.54220015449667 0.126020313332003 0.825306647460321 0.202392768285567 0.251126401405454
Upvotes: 0
Views: 443
Reputation: 102261
Maybe you can use the code below with base R
.
The idea is to first get the range of each columns, and then treat the columns as a matrix for manipulation. I believe it would be much faster than using for
loops (benefits from matrix treatment),i.e.,
r <- apply(d,2,range)
df[-c(1:2)] <- t((t(df[-c(1:2)]) - r[1,])/as.vector(diff(r)))
such that
> df
ol1 col2 1 2 3 4 5 6 7 8
1 str_11 str_44 0.0000000 NaN 0.01482649 0.2083202 1.0000000 0.1677451 0.1692960 0
2 str_11 str_22 1.0000000 NaN 0.27664986 0.0000000 0.3434840 0.2296549 0.0000000 0
3 str_11 str_18 0.8885766 NaN 0.00000000 0.1833516 0.0000000 0.0000000 1.0000000 0
4 str_11 str_1 0.8756412 NaN 1.00000000 1.0000000 0.2340315 1.0000000 0.2964541 1
DATA
> dput(df)
structure(list(ol1 = structure(c(1L, 1L, 1L, 1L), .Label = "str_11", class = "factor"),
col2 = structure(4:1, .Label = c("str_1", "str_18", "str_22",
"str_44"), class = "factor"), `1` = c(0.2064191, 0.9044563,
0.8266786, 0.8176492), `2` = c(0L, 0L, 0L, 0L), `3` = c(0.6061358,
1.7917595, 0.5389965, 5.0673306), `4` = c(0.92798677, 0,
0.81676114, 4.45461768), `5` = c(2.7899374, 1.1412787, 0.2787134,
0.8664189), `6` = c(1.098612, 1.504077, 0, 6.549293), `7` = c(1.395511,
1.008228, 3.295837, 1.686399), `8` = c(0, 0, 0, 2.079442)), class = "data.frame", row.names = c(NA,
-4L))
Upvotes: 2
Reputation: 389135
Here is one way to do this avoiding loops :
#Exclude columns which are not required for calculation
temp <- as.matrix(df[, -c(1:2)])
#Get column-wise minimum
min_vals <- matrixStats::colMins(temp)
#Get column-wise maximum
max_vals <- matrixStats::colMaxs(temp)
#Subtract minimum value of column from each element
s1 <- sweep(temp, 2, min_vals, `-`)
#Divide it by max - min
sweep(s1, 2, (max_vals - min_vals), `/`)
Upvotes: 2