endamaco
endamaco

Reputation: 163

Normalizing columns of matrix between -1 and 1

I have a large matrix (thousands of rows and hundreds of lines) which I'd like to normalize column-wise between -1 and 1. This is the code I wrote:

normalize <- function(x) { 
    for(j in 1:length(x[1,])){
        print(j)
        min <- min(x[,j])
        max <- max(x[,j])
        for(i in 1:length(x[,j])){
            x[i,j] <- 2 * (x[i,j] - min)/( max - min) - 1
        }
    }
    return(x)
}

Unfortunately it waaaay to slow. I've seen this:

normalize <- function(x) { 
    x <- sweep(x, 2, apply(x, 2, min)) 
    sweep(x, 2, apply(x, 2, max), "/") 
}

It's fast but it normalizes between 0 and 1. Can you help me please modifying it for my purpose? I'm sorry but I'm at the beginning learning R

Upvotes: 2

Views: 5281

Answers (4)

Roland
Roland

Reputation: 132706

Benchmarks:

normalize2 <- function(A) { 
  scale(A,center=TRUE,scale=apply(A,2,function(x) 0.5*(max(x)-min(x))))
}

normalize3 <- function(mat) { 
  apply(mat,2,function(x) {xmin <- min(x); 2*(x-xmin)/(max(x)-xmin)-1})
}

normalize4 <- function(x) { 
  aa <- colMeans(x)
  x <- sweep(x, 2, aa)           # retrive the mean from each column

  2* sweep(x, 2, apply(x, 2, function(y) max(y)-min(y)), "/") 
}


set.seed(42)
mat <- matrix(sample(1:10,1e5,TRUE),1e3)
erg2 <- normalize2(mat)
attributes(erg2) <- attributes(normalize3(mat))
all.equal(  
  erg2,  
  normalize3(mat),   
  normalize4(mat)
  )

[1] TRUE

library(microbenchmark)
microbenchmark(normalize4(mat),normalize3(mat),normalize2(mat))

Unit: milliseconds
             expr      min       lq   median       uq      max
1 normalize2(mat) 4.846551 5.486845 5.597799 5.861976 30.46634
2 normalize3(mat) 4.191677 4.862655 4.980571 5.153438 28.94257
3 normalize4(mat) 4.960790 5.648666 5.766207 5.972404 30.08334

set.seed(42)
mat <- matrix(sample(1:10,1e4,TRUE),10)
microbenchmark(normalize4(mat),normalize3(mat),normalize2(mat))

Unit: milliseconds
             expr      min       lq   median       uq       max
1 normalize2(mat) 4.319131 4.445384 4.556756 4.821512  9.116263
2 normalize3(mat) 5.743305 5.927829 6.098392 6.454875 13.439526
3 normalize4(mat) 3.955712 4.102306 4.175394 4.402710  5.773221

The apply solution is slightly slower if the number of columns is small, but slightly faster if the number of columns is large. Overall, performance is of the same magnitude.

Upvotes: 4

agstudy
agstudy

Reputation: 121568

This will rescale the matrix using the same method

normalize <- function(x) { 
  x <- sweep(x, 2, apply(x, 2, mean))           # retrive the mean from each column
  2* sweep(x, 2, apply(x, 2, function(y) max(y)-min(y)), "/") 
}

}

Edit

use colMeans as suggested in comments is faster of course

normalize <- function(x) { 
  aa <- colMeans(x)
  x <- sweep(x, 2, aa)           # retrive the mean from each column

  2* sweep(x, 2, apply(x, 2, function(y) max(y)-min(y)), "/") 
}
A <- matrix(1:24, ncol=3)

> normalize(A)
           [,1]       [,2]       [,3]
[1,] -1.0000000 -1.0000000 -1.0000000
[2,] -0.7142857 -0.7142857 -0.7142857
[3,] -0.4285714 -0.4285714 -0.4285714
[4,] -0.1428571 -0.1428571 -0.1428571
[5,]  0.1428571  0.1428571  0.1428571
[6,]  0.4285714  0.4285714  0.4285714
[7,]  0.7142857  0.7142857  0.7142857
[8,]  1.0000000  1.0000000  1.0000000

EDIT with the scale function of the base package

scale(A,center=TRUE,scale=apply(A,2,function(x) 0.5*(max(x)-min(x))))
           [,1]       [,2]       [,3]
[1,] -1.0000000 -1.0000000 -1.0000000
[2,] -0.7142857 -0.7142857 -0.7142857
[3,] -0.4285714 -0.4285714 -0.4285714
[4,] -0.1428571 -0.1428571 -0.1428571
[5,]  0.1428571  0.1428571  0.1428571
[6,]  0.4285714  0.4285714  0.4285714
[7,]  0.7142857  0.7142857  0.7142857
[8,]  1.0000000  1.0000000  1.0000000

Upvotes: 2

Theodore Lytras
Theodore Lytras

Reputation: 3965

How about rescaling the matrix x at the end of your own function?

normalize <- function(x) { 
    x <- sweep(x, 2, apply(x, 2, min)) 
    x <- sweep(x, 2, apply(x, 2, max), "/") 
    2*x - 1
}

Upvotes: 4

Se&#241;or O
Se&#241;or O

Reputation: 17412

How about just:

x[,1] <- (x[,1]-mean(x[,1]))/(max(x[,1])-min(x[,1]))

Most basic functions in R are vectorized, so there's no need to include a for loop in your code. This snippet will scale all of column 1 (you can also use the function scale(), although it doesn't have an option for min/max values).

To do an entire dataset, you can do something like this:

Scale <- function(y) y <- (y-mean(y))/(max(y)-min(y))
DataFrame.Scaled <- apply(DataFrame, 2, Scale)

Edit: It's also worth pointing out that you don't want to name a value after a function. When you do min <- min(x), it will lead to some confusion to R the next time you ask for min.

Upvotes: 1

Related Questions