Reputation: 163
I have a large matrix (thousands of rows and hundreds of lines) which I'd like to normalize column-wise between -1 and 1. This is the code I wrote:
normalize <- function(x) {
for(j in 1:length(x[1,])){
print(j)
min <- min(x[,j])
max <- max(x[,j])
for(i in 1:length(x[,j])){
x[i,j] <- 2 * (x[i,j] - min)/( max - min) - 1
}
}
return(x)
}
Unfortunately it waaaay to slow. I've seen this:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
It's fast but it normalizes between 0 and 1. Can you help me please modifying it for my purpose? I'm sorry but I'm at the beginning learning R
Upvotes: 2
Views: 5281
Reputation: 132706
Benchmarks:
normalize2 <- function(A) {
scale(A,center=TRUE,scale=apply(A,2,function(x) 0.5*(max(x)-min(x))))
}
normalize3 <- function(mat) {
apply(mat,2,function(x) {xmin <- min(x); 2*(x-xmin)/(max(x)-xmin)-1})
}
normalize4 <- function(x) {
aa <- colMeans(x)
x <- sweep(x, 2, aa) # retrive the mean from each column
2* sweep(x, 2, apply(x, 2, function(y) max(y)-min(y)), "/")
}
set.seed(42)
mat <- matrix(sample(1:10,1e5,TRUE),1e3)
erg2 <- normalize2(mat)
attributes(erg2) <- attributes(normalize3(mat))
all.equal(
erg2,
normalize3(mat),
normalize4(mat)
)
[1] TRUE
library(microbenchmark)
microbenchmark(normalize4(mat),normalize3(mat),normalize2(mat))
Unit: milliseconds
expr min lq median uq max
1 normalize2(mat) 4.846551 5.486845 5.597799 5.861976 30.46634
2 normalize3(mat) 4.191677 4.862655 4.980571 5.153438 28.94257
3 normalize4(mat) 4.960790 5.648666 5.766207 5.972404 30.08334
set.seed(42)
mat <- matrix(sample(1:10,1e4,TRUE),10)
microbenchmark(normalize4(mat),normalize3(mat),normalize2(mat))
Unit: milliseconds
expr min lq median uq max
1 normalize2(mat) 4.319131 4.445384 4.556756 4.821512 9.116263
2 normalize3(mat) 5.743305 5.927829 6.098392 6.454875 13.439526
3 normalize4(mat) 3.955712 4.102306 4.175394 4.402710 5.773221
The apply
solution is slightly slower if the number of columns is small, but slightly faster if the number of columns is large. Overall, performance is of the same magnitude.
Upvotes: 4
Reputation: 121568
This will rescale the matrix using the same method
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, mean)) # retrive the mean from each column
2* sweep(x, 2, apply(x, 2, function(y) max(y)-min(y)), "/")
}
}
Edit
use colMeans
as suggested in comments is faster of course
normalize <- function(x) {
aa <- colMeans(x)
x <- sweep(x, 2, aa) # retrive the mean from each column
2* sweep(x, 2, apply(x, 2, function(y) max(y)-min(y)), "/")
}
A <- matrix(1:24, ncol=3)
> normalize(A)
[,1] [,2] [,3]
[1,] -1.0000000 -1.0000000 -1.0000000
[2,] -0.7142857 -0.7142857 -0.7142857
[3,] -0.4285714 -0.4285714 -0.4285714
[4,] -0.1428571 -0.1428571 -0.1428571
[5,] 0.1428571 0.1428571 0.1428571
[6,] 0.4285714 0.4285714 0.4285714
[7,] 0.7142857 0.7142857 0.7142857
[8,] 1.0000000 1.0000000 1.0000000
EDIT with the scale
function of the base package
scale(A,center=TRUE,scale=apply(A,2,function(x) 0.5*(max(x)-min(x))))
[,1] [,2] [,3]
[1,] -1.0000000 -1.0000000 -1.0000000
[2,] -0.7142857 -0.7142857 -0.7142857
[3,] -0.4285714 -0.4285714 -0.4285714
[4,] -0.1428571 -0.1428571 -0.1428571
[5,] 0.1428571 0.1428571 0.1428571
[6,] 0.4285714 0.4285714 0.4285714
[7,] 0.7142857 0.7142857 0.7142857
[8,] 1.0000000 1.0000000 1.0000000
Upvotes: 2
Reputation: 3965
How about rescaling the matrix x
at the end of your own function?
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
x <- sweep(x, 2, apply(x, 2, max), "/")
2*x - 1
}
Upvotes: 4
Reputation: 17412
How about just:
x[,1] <- (x[,1]-mean(x[,1]))/(max(x[,1])-min(x[,1]))
Most basic functions in R
are vectorized, so there's no need to include a for
loop in your code. This snippet will scale all of column 1 (you can also use the function scale()
, although it doesn't have an option for min/max values).
To do an entire dataset, you can do something like this:
Scale <- function(y) y <- (y-mean(y))/(max(y)-min(y))
DataFrame.Scaled <- apply(DataFrame, 2, Scale)
Edit: It's also worth pointing out that you don't want to name a value after a function. When you do min <- min(x)
, it will lead to some confusion to R the next time you ask for min
.
Upvotes: 1