Reputation: 5044
I have a dataset called spam
which contains 58 columns and approximately 3500 rows of data related to spam messages.
I plan on running some linear regression on this dataset in the future, but I'd like to do some pre-processing beforehand and standardize the columns to have zero mean and unit variance.
I've been told the best way to go about this is with R, so I'd like to ask how can i achieve normalization with R? I've already got the data properly loaded and I'm just looking for some packages or methods to perform this task.
Upvotes: 267
Views: 650795
Reputation: 21
The code below could be the shortest way to achieve this.
dataframe <- apply(dataframe, 2, scale)
Upvotes: 0
Reputation: 1369
The collapse package provides the fastest scale function - implemented in C++ using Welfords Online Algorithm:
dat <- data.frame(x = rnorm(1e6, 30, .2),
y = runif(1e6, 3, 5),
z = runif(1e6, 10, 20))
library(collapse)
library(microbenchmark)
microbenchmark(fscale(dat), scale(dat))
Unit: milliseconds
expr min lq mean median uq max neval cld
fscale(dat) 27.86456 29.5864 38.96896 30.80421 43.79045 313.5729 100 a
scale(dat) 357.07130 391.0914 489.93546 416.33626 625.38561 793.2243 100 b
Furthermore: fscale
is S3 generic for vectors, matrices and data frames and also supports grouped and/or weighted scaling operations, as well as scaling to arbitrary means and standard deviations.
Upvotes: 4
Reputation: 53
@BBKim pretty much gave the best answer, but it can just be done shorter. I'm surprised noone came up with it yet.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
dat <- apply(dat, 2, function(x) (x - mean(x)) / sd(x))
Upvotes: 1
Reputation: 3635
Realizing that the question is old and one answer is accepted, I'll provide another answer for reference.
scale
is limited by the fact that it scales all variables. The solution below allows to scale only specific variable names while preserving other variables unchanged (and the variable names could be dynamically generated):
library(dplyr)
set.seed(1234)
dat <- data.frame(x = rnorm(10, 30, .2),
y = runif(10, 3, 5),
z = runif(10, 10, 20))
dat
dat2 <- dat %>% mutate_at(c("y", "z"), ~(scale(.) %>% as.vector))
dat2
which gives me this:
> dat
x y z
1 29.75859 3.633225 14.56091
2 30.05549 3.605387 12.65187
3 30.21689 3.318092 13.04672
4 29.53086 3.079992 15.07307
5 30.08582 3.437599 11.81096
6 30.10121 4.621197 17.59671
7 29.88505 4.051395 12.01248
8 29.89067 4.829316 12.58810
9 29.88711 4.662690 19.92150
10 29.82199 3.091541 18.07352
and
> dat2 <- dat %>% mutate_at(c("y", "z"), ~(scale(.) %>% as.vector))
> dat2
x y z
1 29.75859 -0.3004815 -0.06016029
2 30.05549 -0.3423437 -0.72529604
3 30.21689 -0.7743696 -0.58772361
4 29.53086 -1.1324181 0.11828039
5 30.08582 -0.5946582 -1.01827752
6 30.10121 1.1852038 0.99754666
7 29.88505 0.3283513 -0.94806607
8 29.89067 1.4981677 -0.74751378
9 29.88711 1.2475998 1.80753470
10 29.82199 -1.1150515 1.16367556
EDIT 1 (2016): Addressed Julian's comment: the output of scale
is Nx1 matrix so ideally we should add an as.vector
to convert the matrix type back into a vector type. Thanks Julian!
EDIT 2 (2019): Quoting Duccio A.'s comment: For the latest dplyr (version 0.8) you need to change dplyr::funcs with list, like dat %>% mutate_each_(list(~scale(.) %>% as.vector), vars=c("y","z"))
EDIT 3 (2020): Thanks to @mj_whales: the old solution is deprecated and now we need to use mutate_at
.
Upvotes: 127
Reputation: 3631
The normalize function from the BBMisc package was the right tool for me since it can deal with NA values.
Here is how to use it:
Given the following dataset,
ASR_API <- c("CV", "F", "IER", "LS-c", "LS-o")
Human <- c(NA, 5.8, 12.7, NA, NA)
Google <- c(23.2, 24.2, 16.6, 12.1, 28.8)
GoogleCloud <- c(23.3, 26.3, 18.3, 12.3, 27.3)
IBM <- c(21.8, 47.6, 24.0, 9.8, 25.3)
Microsoft <- c(29.1, 28.1, 23.1, 18.8, 35.9)
Speechmatics <- c(19.1, 38.4, 21.4, 7.3, 19.4)
Wit_ai <- c(35.6, 54.2, 37.4, 19.2, 41.7)
dt <- data.table(ASR_API,Human, Google, GoogleCloud, IBM, Microsoft, Speechmatics, Wit_ai)
> dt
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai
1: CV NA 23.2 23.3 21.8 29.1 19.1 35.6
2: F 5.8 24.2 26.3 47.6 28.1 38.4 54.2
3: IER 12.7 16.6 18.3 24.0 23.1 21.4 37.4
4: LS-c NA 12.1 12.3 9.8 18.8 7.3 19.2
5: LS-o NA 28.8 27.3 25.3 35.9 19.4 41.7
normalized values can be obtained like this:
> dtn <- normalize(dt, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
> dtn
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai
1: CV NA 0.3361245 0.2893457 -0.28468670 0.3247336 -0.18127203 -0.16032655
2: F -0.7071068 0.4875320 0.7715885 1.59862532 0.1700986 1.55068347 1.31594762
3: IER 0.7071068 -0.6631646 -0.5143923 -0.12409420 -0.6030768 0.02512682 -0.01746131
4: LS-c NA -1.3444981 -1.4788780 -1.16064578 -1.2680075 -1.24018782 -1.46198764
5: LS-o NA 1.1840062 0.9323361 -0.02919864 1.3762521 -0.15435044 0.32382788
where hand calculated method just ignores colmuns containing NAs:
> dt %>% mutate(normalizedHuman = (Human - mean(Human))/sd(Human)) %>%
+ mutate(normalizedGoogle = (Google - mean(Google))/sd(Google)) %>%
+ mutate(normalizedGoogleCloud = (GoogleCloud - mean(GoogleCloud))/sd(GoogleCloud)) %>%
+ mutate(normalizedIBM = (IBM - mean(IBM))/sd(IBM)) %>%
+ mutate(normalizedMicrosoft = (Microsoft - mean(Microsoft))/sd(Microsoft)) %>%
+ mutate(normalizedSpeechmatics = (Speechmatics - mean(Speechmatics))/sd(Speechmatics)) %>%
+ mutate(normalizedWit_ai = (Wit_ai - mean(Wit_ai))/sd(Wit_ai))
ASR_API Human Google GoogleCloud IBM Microsoft Speechmatics Wit_ai normalizedHuman normalizedGoogle
1 CV NA 23.2 23.3 21.8 29.1 19.1 35.6 NA 0.3361245
2 F 5.8 24.2 26.3 47.6 28.1 38.4 54.2 NA 0.4875320
3 IER 12.7 16.6 18.3 24.0 23.1 21.4 37.4 NA -0.6631646
4 LS-c NA 12.1 12.3 9.8 18.8 7.3 19.2 NA -1.3444981
5 LS-o NA 28.8 27.3 25.3 35.9 19.4 41.7 NA 1.1840062
normalizedGoogleCloud normalizedIBM normalizedMicrosoft normalizedSpeechmatics normalizedWit_ai
1 0.2893457 -0.28468670 0.3247336 -0.18127203 -0.16032655
2 0.7715885 1.59862532 0.1700986 1.55068347 1.31594762
3 -0.5143923 -0.12409420 -0.6030768 0.02512682 -0.01746131
4 -1.4788780 -1.16064578 -1.2680075 -1.24018782 -1.46198764
5 0.9323361 -0.02919864 1.3762521 -0.15435044 0.32382788
(normalizedHuman is made a list of NAs ...)
regarding the selection of specific columns for calculation, a generic method can be employed like this one:
data_vars <- df_full %>% dplyr::select(-ASR_API,-otherVarNotToBeUsed)
meta_vars <- df_full %>% dplyr::select(ASR_API,otherVarNotToBeUsed)
data_varsn <- normalize(data_vars, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
dtn <- cbind(meta_vars,data_varsn)
Upvotes: -1
Reputation: 197
The dplyr
package has two functions that do this.
> require(dplyr)
To mutate specific columns of a data table, you can use the function mutate_at()
. To mutate all columns, you can use mutate_all
.
The following is a brief example for using these functions to standardize data.
Mutate specific columns:
dt = data.table(a = runif(3500), b = runif(3500), c = runif(3500))
dt = data.table(dt %>% mutate_at(vars("a", "c"), scale)) # can also index columns by number, e.g., vars(c(1,3))
> apply(dt, 2, mean)
a b c
1.783137e-16 5.064855e-01 -5.245395e-17
> apply(dt, 2, sd)
a b c
1.0000000 0.2906622 1.0000000
Mutate all columns:
dt = data.table(a = runif(3500), b = runif(3500), c = runif(3500))
dt = data.table(dt %>% mutate_all(scale))
> apply(dt, 2, mean)
a b c
-1.728266e-16 9.291994e-17 1.683551e-16
> apply(dt, 2, sd)
a b c
1 1 1
Upvotes: 2
Reputation: 151
Again, even though this is an old question, it is very relevant! And I have found a simple way to normalise certain columns without the need of any packages:
normFunc <- function(x){(x-mean(x, na.rm = T))/sd(x, na.rm = T)}
For example
x<-rnorm(10,14,2)
y<-rnorm(10,7,3)
z<-rnorm(10,18,5)
df<-data.frame(x,y,z)
df[2:3] <- apply(df[2:3], 2, normFunc)
You will see that the y and z columns have been normalised. No packages needed :-)
Upvotes: 11
Reputation: 6302
With dplyr
v0.7.4 all variables can be scaled by using mutate_all()
:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
set.seed(1234)
dat <- tibble(x = rnorm(10, 30, .2),
y = runif(10, 3, 5),
z = runif(10, 10, 20))
dat %>% mutate_all(scale)
#> # A tibble: 10 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 -0.827 -0.300 -0.0602
#> 2 0.663 -0.342 -0.725
#> 3 1.47 -0.774 -0.588
#> 4 -1.97 -1.13 0.118
#> 5 0.816 -0.595 -1.02
#> 6 0.893 1.19 0.998
#> 7 -0.192 0.328 -0.948
#> 8 -0.164 1.50 -0.748
#> 9 -0.182 1.25 1.81
#> 10 -0.509 -1.12 1.16
Specific variables can be excluded using mutate_at()
:
dat %>% mutate_at(scale, .vars = vars(-x))
#> # A tibble: 10 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 29.8 -0.300 -0.0602
#> 2 30.1 -0.342 -0.725
#> 3 30.2 -0.774 -0.588
#> 4 29.5 -1.13 0.118
#> 5 30.1 -0.595 -1.02
#> 6 30.1 1.19 0.998
#> 7 29.9 0.328 -0.948
#> 8 29.9 1.50 -0.748
#> 9 29.9 1.25 1.81
#> 10 29.8 -1.12 1.16
Created on 2018-04-24 by the reprex package (v0.2.0).
Upvotes: 18
Reputation: 395
Scale can be used for both full data frame and specific columns. For specific columns, following code can be used:
trainingSet[, 3:7] = scale(trainingSet[, 3:7]) # For column 3 to 7
trainingSet[, 8] = scale(trainingSet[, 8]) # For column 8
Full data frame
trainingSet <- scale(trainingSet)
Upvotes: 9
Reputation: 1573
Before I happened to find this thread, I had the same problem. I had user dependant column types, so I wrote a for
loop going through them and getting needed columns scale
'd. There are probably better ways to do it, but this solved the problem just fine:
for(i in 1:length(colnames(df))) {
if(class(df[,i]) == "numeric" || class(df[,i]) == "integer") {
df[,i] <- as.vector(scale(df[,i])) }
}
as.vector
is a needed part, because it turned out scale
does rownames x 1
matrix which is usually not what you want to have in your data.frame
.
Upvotes: 1
Reputation: 456
'Caret' package provides methods for preprocessing data (e.g. centering and scaling). You could also use the following code:
library(caret)
# Assuming goal class is column 10
preObj <- preProcess(data[, -10], method=c("center", "scale"))
newData <- predict(preObj, data[, -10])
More details: http://www.inside-r.org/node/86978
Upvotes: 28
Reputation: 359
When I used the solution stated by Dason, instead of getting a data frame as a result, I got a vector of numbers (the scaled values of my df).
In case someone is having the same trouble, you have to add as.data.frame() to the code, like this:
df.scaled <- as.data.frame(scale(df))
I hope this is will be useful for ppl having the same issue!
Upvotes: 25
Reputation: 909
This is 3 years old. Still, I feel I have to add the following:
The most common normalization is the z-transformation, where you subtract the mean and divide by the standard deviation of your variable. The result will have mean=0 and sd=1.
For that, you don't need any package.
zVar <- (myVar - mean(myVar)) / sd(myVar)
That's it.
Upvotes: 75
Reputation: 473
You can easily normalize the data also using data.Normalization function in clusterSim package. It provides different method of data normalization.
data.Normalization (x,type="n0",normalization="column")
Arguments
x
vector, matrix or dataset
type
type of normalization:
n0 - without normalization
n1 - standardization ((x-mean)/sd)
n2 - positional standardization ((x-median)/mad)
n3 - unitization ((x-mean)/range)
n3a - positional unitization ((x-median)/range)
n4 - unitization with zero minimum ((x-min)/range)
n5 - normalization in range <-1,1> ((x-mean)/max(abs(x-mean)))
n5a - positional normalization in range <-1,1> ((x-median)/max(abs(x-median)))
n6 - quotient transformation (x/sd)
n6a - positional quotient transformation (x/mad)
n7 - quotient transformation (x/range)
n8 - quotient transformation (x/max)
n9 - quotient transformation (x/mean)
n9a - positional quotient transformation (x/median)
n10 - quotient transformation (x/sum)
n11 - quotient transformation (x/sqrt(SSQ))
n12 - normalization ((x-mean)/sqrt(sum((x-mean)^2)))
n12a - positional normalization ((x-median)/sqrt(sum((x-median)^2)))
n13 - normalization with zero being the central point ((x-midrange)/(range/2))
normalization
"column" - normalization by variable, "row" - normalization by object
Upvotes: 20
Reputation: 27
Use the package "recommenderlab". Download and install the package. This package has a command "Normalize" in built. It also allows you to choose one of the many methods for normalization namely 'center' or 'Z-score' Follow the following example:
## create a matrix with ratings
m <- matrix(sample(c(NA,0:5),50, replace=TRUE, prob=c(.5,rep(.5/6,6))),nrow=5, ncol=10, dimnames = list(users=paste('u', 1:5, sep=”), items=paste('i', 1:10, sep=”)))
## do normalization
r <- as(m, "realRatingMatrix")
#here, 'centre' is the default method
r_n1 <- normalize(r)
#here "Z-score" is the used method used
r_n2 <- normalize(r, method="Z-score")
r
r_n1
r_n2
## show normalized data
image(r, main="Raw Data")
image(r_n1, main="Centered")
image(r_n2, main="Z-Score Normalization")
Upvotes: 0
Reputation: 61933
I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale
function on the data to do what you want.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)
# check that we get mean of 0 and sd of 1
colMeans(scaled.dat) # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)
Using built in functions is classy. Like this cat:
Upvotes: 642