Reputation: 33
Apologies in advance for this incredibly dumb question, but after scaling a dataset, I'm a bit baffled by the way the column sums behave. Anyone have a quick answer for me?
data("USArrests")
df <- USArrests
df <- scale(df)
sum(df[,1])
# -3.833739e-15
sum(df[1:50,1])
# -3.833739e-15
sum(df[1:49,1])
# 0.2268391
sum(df[50,1])
# -0.2268391
sum(df[2:50,1])
# -1.242564
sum(df[1,1])
# 1.242564
Similar happens with mean() where doing a mean of a whole column gives me an insane value, however removing one row doesn't. I'm feeling incredibly dumb this morning and need a hand to get past this.
Upvotes: 0
Views: 90
Reputation: 1438
It's important to understand what scale()
is doing to your data. I've pulled an example from https://stackoverflow.com/a/20256272/11167644 to explain:
set.seed(1)
x <- runif(6)
x
#> [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819 0.8983897
(x - mean(x)) / sd(x)
#> [1] -0.8717643 -0.5287394 0.1170895 1.1960620 -1.0771210 1.1644732
scale(x)[1:6]
#> [1] -0.8717643 -0.5287394 0.1170895 1.1960620 -1.0771210 1.1644732
Your data is being scaled and centered around zero - we can further verify this by looking that the summary()
of both the unscaled and scaled data sets:
data("USArrests")
df <- USArrests
summary(df)
#> Murder Assault UrbanPop Rape
#> Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
#> 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
#> Median : 7.250 Median :159.0 Median :66.00 Median :20.10
#> Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
#> 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
#> Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
summary(scale(df))
#> Murder Assault UrbanPop Rape
#> Min. :-1.6044 Min. :-1.5090 Min. :-2.31714 Min. :-1.4874
#> 1st Qu.:-0.8525 1st Qu.:-0.7411 1st Qu.:-0.76271 1st Qu.:-0.6574
#> Median :-0.1235 Median :-0.1411 Median : 0.03178 Median :-0.1209
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
#> 3rd Qu.: 0.7949 3rd Qu.: 0.9388 3rd Qu.: 0.84354 3rd Qu.: 0.5277
#> Max. : 2.2069 Max. : 1.9948 Max. : 1.75892 Max. : 2.6444
Again, noting the mean of zero - this explains why the data sums to zero.
Finally we can look visually at what the scaled vs. unscaled data looks like with some histograms:
library(tidyverse)
df %>%
select(Murder) %>%
mutate(Scaled_Murder = scale(Murder)) %>%
pivot_longer(everything()) %>%
ggplot(aes(value, fill = name)) +
geom_histogram(alpha = 0.75, position = "identity", bins = 20)
Created on 2021-03-02 by the reprex package (v0.3.0)
Upvotes: 1