Reputation: 531
New to R, and I have two data sets -- they have the same x-axis values, but the y-axis varies.
I'm trying to find the correlation between the two. When I use R to draw the abline
s through the scatter plot, it gives me two lines-of-best-fit that seemingly makes one data set higher than the other -- but I'd really like to know the p-value between these two data sets to know the effect.
After looking it up, it seems like I should use t.test
-- but I'm unsure how to run them against each other.
For example, if I run:
t.test(t1$xaxis,t1$yaxis1)
t.test(t2$xaxis,t2$yaxis2)
It gives me the right means of x and y (t1: 16.84, 88.58 and t2: 14.79, 86.14) -- but for the rest, I'm not sure:
t1: t = -43.8061, df = 105.994, p-value < 2.2e-16
t2: t = -60.1593, df = 232.742, p-value < 2.2e-16
Obviously the p-values given are (a) microscopic, and (b) I don't know how to make it tell me about the data sets relationship with each other -- and not individually.
Any help is greatly appreciated -- thanks!
Upvotes: 0
Views: 5237
Reputation: 132706
Since you asked for it, here is how I understand your problem.
You have two groups of y values corresponding to identical x values. Here I assume that the relationship between y and x is linear. If it isn't you could transform your variables, use a non-linear model, an additive model, ...
First let's simulate some data since you don't provide any:
set.seed(42)
x <- 1:20
y1 <- 2.5 + 3 * x +rnorm(20)
y2 <- 4 + 2.5 * x +rnorm(20)
plot(y1~x, col="blue", ylab="y")
points(y2~x, col="red")
legend("topleft", legend=c("y1", "y2"), col=c("blue", "red"), pch=1)
Now, we want to know if the two samples differ. We can find out by fitting a model:
DF <- cbind(stack(cbind.data.frame(y1, y2)), x)
names(DF) <- c("y", "group", "x")
fit <- lm(y~x*group, data=DF)
summary(fit)
Call:
lm(formula = y ~ x * group, data = DF)
Residuals:
Min 1Q Median 3Q Max
-2.2585 -0.4603 -0.1899 0.9008 2.2127
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.51769 0.55148 6.379 2.17e-07 ***
x 2.92136 0.04604 63.457 < 2e-16 ***
groupy2 0.67218 0.77991 0.862 0.394
x:groupy2 -0.46525 0.06511 -7.146 2.11e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.187 on 36 degrees of freedom
Multiple R-squared: 0.9949, Adjusted R-squared: 0.9945
F-statistic: 2333 on 3 and 36 DF, p-value: < 2.2e-16
The intercepts are not significantly different, but the slopes are. If group
is a significant effect, we can test best by comparing with a model that doesn't consider group
:
fit0 <- lm(y~x, data=DF)
anova(fit0, fit)
Analysis of Variance Table
Model 1: y ~ x
Model 2: y ~ x * group
Res.Df RSS Df Sum of Sq F Pr(>F)
1 38 300.196
2 36 50.738 2 249.46 88.498 1.267e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As you see, the samples are different.
Upvotes: 2
Reputation: 1013
Judging by your comments above, looks like you are after a 2-sample test of means. Is this what you are after? If so,
set.seed(1)
y1 = rnorm(100)
y2 = rnorm(120, mean=0.1)
results = t.test(y1,y2)
results$p.value
Upvotes: 1
Reputation: 99331
You can easily find the correlation between variables with the cor
function. In this case, I use a data frame first, then a matrix. We can easily see the strength of the relationships between variables.
> d <- data.frame(y1 = runif(10), y2 = rnorm(10), y3 = rexp(10))
> cor(d)
## y1 y2 y3
## y1 1.0000000 -0.3319495 -0.4013154
## y2 -0.3319495 1.0000000 0.1370312
## y3 -0.4013154 0.1370312 1.0000000
Using a matrix,
> m <- matrix(c(runif(10), rnorm(10), rexp(10)), 10, 3)
> cor(m)
## [,1] [,2] [,3]
## [1,] 1.0000000 -0.1971826 0.3622307
## [2,] -0.1971826 1.0000000 0.4973368
## [3,] 0.3622307 0.4973368 1.0000000
Please see example(cor)
for more.
Upvotes: 1
Reputation: 1041
Did you thought about merging the datasets based on x axis so that you data structure becomes like:
X Y1 Y2
Then you can find correlation between any of the columns you want.
Upvotes: 1