Reputation: 10979
I have about 30 lines of code that do just this (getting Z scores):
data$z_col1 <- (data$col1 - mean(data$col1, na.rm = TRUE)) / sd(data$col1, na.rm = TRUE)
data$z_col2 <- (data$col2 - mean(data$col2, na.rm = TRUE)) / sd(data$col2, na.rm = TRUE)
data$z_col3 <- (data$col3 - mean(data$col3, na.rm = TRUE)) / sd(data$col3, na.rm = TRUE)
data$z_col4 <- (data$col4 - mean(data$col4, na.rm = TRUE)) / sd(data$col4, na.rm = TRUE)
data$z_col5 <- (data$col5 - mean(data$col5, na.rm = TRUE)) / sd(data$col5, na.rm = TRUE)
Is there some way, maybe using apply()
or something, that I can just essentially do (python):
for col in ['col1', 'col2', 'col3']:
data{col} = ... z score code here
Thanks R friends.
Upvotes: 3
Views: 8860
Reputation: 138
Check this out I iterate through the data frame to recognise NA rows
for(i in names(houseDF)){
print(i)
print(nrow(houseDF[is.na(houseDF[i]),]))
print("---------------------")
}
Upvotes: 0
Reputation: 115515
A data.frame
is a list, thus you can use lapply
. Don't use apply
on a data.frame
as this will coerce to a matrix
.
lapply(data, function(x) (x - mean(x,na.rm = TRUE))/sd(x, na.rm = TRUE))
Or you could use scale
which performs this calculation on a vector.
lapply(data, scale)
You can translate the python
style approach directy
for(col in names(data)){
data[[col]] <- scale(data[[col]])
}
Note that this approach is not memory efficient in R as [[<.data.frame
copies the entire data.frame each time.
Upvotes: 6
Reputation: 3561
I think you're right, apply() may be the way to go here.
For example:
data <- array(1:20, dim=c(4, 5))
data.zscores <- apply(data, 2, function(x)
(x-mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE))
The function apply() takes a matrix or array as it's first argument. The "2" refers to the dimension the function is iterated over - which in our case is columns. If we wanted to do it by row, we'd go with "1". Lastly, we have the function we want to apply to each column. See ?apply for more details.
Upvotes: 2