Reputation: 103
My data frame, my.data, contains both numeric and factor variables. I want to standardise just the numeric variables in this data frame.
> mydata2=data.frame(scale(my.data, center=T, scale=T))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Could the standardising work by doing this? I want to standardise the columns 8,9,10,11 and 12 but I think I have the wrong code.
mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))
Thanks in advance
Upvotes: 9
Views: 10448
Reputation: 351
Here are some options to consider, although it is answered late:
# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)
# Set working directory
setwd("path")
# Example data frame
df <- data.frame("Age" = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39),
"Name" = c("Christine", "Kim", "Kevin", "Aishwarya", "Rafel", "Bettina", "Joshua", "Afreen", "Wang", "Kerubo"),
"Salary in $" = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
"Gender" = c("Female", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Male"),
"Height in cm" = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
"Weight in kg" = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))
Let us check the structure of df:
str(df)
'data.frame': 10 obs. of 6 variables:
$ Age : num 21 19 25 34 45 63 39 28 50 39
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num 2138 1516 2213 2500 2660 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num 172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num 60 70 88 48 71 51 65 44 53 91
We see that Age, Salary, Height and Weight are numeric and Name and Gender are categorical (factor variables).
Let us scale just the numeric variables using only base R:
1) Option: (slight modification of what akrun has proposed here)
start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
(x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 - start_time1
Time difference of 0.02717805 secs
str(df1)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
2) Option: (akrun's approach)
start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 - start_time2
Time difference of 0.02599907 secs
str(df2)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
3) Option:
start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 - start_time3
Time difference of -59.6766 secs
str(df3)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
4) Option (using tidyverse and invoking dplyr):
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 - start_time4
Time difference of 0.012043 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
Based on what kind of structure as output you demand and speed, you can judge. If your data is unbalanced and you want to balance it, and suppose you want to do classification after that after scaling the numeric variables, the matrix numeric structure of the numeric variables, namely - Age, Salary, Height and Weight will cause problems. I mean,
str(df4$Age)
num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
- attr(*, "scaled:center")= num 36.3
- attr(*, "scaled:scale")= num 13.8
Since, for example, ROSE package (which balances data) doesn't accept data structures apart from int, factor and num, it will throw an error.
To avoid this issue, the numeric variables after scaling can be saved as vectors instead of a column matrix by:
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)
end_time4 <- Sys.time()
end_time4 - start_time4
with
Time difference of 0.01400399 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
Upvotes: 0
Reputation: 99
You can use the dplyr package to do this:
mydata2%>%mutate_if(is.numeric,scale)
Upvotes: 4
Reputation: 887951
Here is one option to standardize
mydata[] <- lapply(mydata, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x)
Upvotes: 10