Reputation: 3228
Dput for data frame:
structure(list(aptitude = c(78, 85, 69, 80, 60, 72, 77, 65, 70,
80, 75, 83, 81, 65, 77, 76, 64, 68, 74, 85, 83, 80, 62, 69, 66,
75, 68, 70), performance = c(74, 59, 59, 60, 55, 62, 59, 64,
50, 64, 60, 59, 51, 64, 58, 49, 43, 62, 49, 59, 59, 60, 43, 62,
49, 64, 38, 74)), class = "data.frame", row.names = c(NA, -28L
))
I have run a correlation on this dataset using the following command:
# Run correlation of apt and perform:
hw %>%
correlation() # r = .28, p value = .145
However, the aptitude variable has a cutoff of 60, or in other words, the minimum value of aptitude is 60 and there can be no scores below it. With this being the case, I am trying to correct the correlation to include this in some way.
I tried looking for packages/commands in R that have this range restriction, but I'm having issues finding anything that matches this. RDocumentation lists rCCr and rangeCorrection but they don't seem to be available anymore from what I can gather.
Any help would be great!
Upvotes: 0
Views: 138
Reputation: 10375
Your data distribution does not matter in computing a correlation coefficient. If one sample is distributed from [0, 100], while another is in [0,inf] or [100,200], or some other range, this won't affect the coefficient.
Maybe it would be easier to demonstrate with an example, some made-up data. Y and X both in the range [1,100].
y=rnorm(100)+seq(1,100,1)
x=rnorm(100)+seq(1,100,1)
plot(y~x)
cor(y,x)
[1] 0.9988158
The relationship is very linear and has a very high Pearson correlation. Now try transforming one of the variables, for ex. Y such that it has range from [100,200] while keeping the other as is.
cor(y+100,x)
[1] 0.9988158
It makes no difference. Why? Because you are just adding a constant to a random variable, which does not affect the variance of this variable, i.e. Var(a+Y) = Var(Y), which is what you are using when estimating a correlation coefficient.
Upvotes: 1