Phil Lewis
Phil Lewis

Reputation: 33

handling a scale()-d dataset in r

Apologies in advance for this incredibly dumb question, but after scaling a dataset, I'm a bit baffled by the way the column sums behave. Anyone have a quick answer for me?

data("USArrests")
df <- USArrests
df  <-  scale(df)
sum(df[,1])
# -3.833739e-15
sum(df[1:50,1])
# -3.833739e-15

sum(df[1:49,1])
# 0.2268391
sum(df[50,1])
# -0.2268391

sum(df[2:50,1])
# -1.242564
sum(df[1,1])
# 1.242564

Similar happens with mean() where doing a mean of a whole column gives me an insane value, however removing one row doesn't. I'm feeling incredibly dumb this morning and need a hand to get past this.

Upvotes: 0

Views: 90

Answers (1)

tomasu
tomasu

Reputation: 1438

It's important to understand what scale() is doing to your data. I've pulled an example from https://stackoverflow.com/a/20256272/11167644 to explain:

set.seed(1)
x <- runif(6)

x
#> [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819 0.8983897

(x - mean(x)) / sd(x)
#> [1] -0.8717643 -0.5287394  0.1170895  1.1960620 -1.0771210  1.1644732

scale(x)[1:6]
#> [1] -0.8717643 -0.5287394  0.1170895  1.1960620 -1.0771210  1.1644732

Your data is being scaled and centered around zero - we can further verify this by looking that the summary() of both the unscaled and scaled data sets:

data("USArrests")

df <- USArrests

summary(df)
#>      Murder          Assault         UrbanPop          Rape      
#>  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
#>  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
#>  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
#>  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
#>  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
#>  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

summary(scale(df))
#>      Murder           Assault           UrbanPop             Rape        
#>  Min.   :-1.6044   Min.   :-1.5090   Min.   :-2.31714   Min.   :-1.4874  
#>  1st Qu.:-0.8525   1st Qu.:-0.7411   1st Qu.:-0.76271   1st Qu.:-0.6574  
#>  Median :-0.1235   Median :-0.1411   Median : 0.03178   Median :-0.1209  
#>  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
#>  3rd Qu.: 0.7949   3rd Qu.: 0.9388   3rd Qu.: 0.84354   3rd Qu.: 0.5277  
#>  Max.   : 2.2069   Max.   : 1.9948   Max.   : 1.75892   Max.   : 2.6444

Again, noting the mean of zero - this explains why the data sums to zero.

Finally we can look visually at what the scaled vs. unscaled data looks like with some histograms:

library(tidyverse)

df %>% 
  select(Murder) %>% 
  mutate(Scaled_Murder = scale(Murder)) %>% 
  pivot_longer(everything()) %>% 
  ggplot(aes(value, fill = name)) +
  geom_histogram(alpha = 0.75, position = "identity", bins = 20)

Created on 2021-03-02 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions