vieplivee
vieplivee

Reputation: 121

How to calculate correlation of two variables in a huge data set in R?

I've got a huge data set with six columns (call them A, B, C, D, E, F), about 450,000 rows. I simply tried to find the correlation between columns A and B:

cor(A, B)

and I got

[1] NA

as a result. What can I do to fix this problem?

Upvotes: 6

Views: 2504

Answers (2)

Iain
Iain

Reputation: 1638

You might consider using the rcorr function in the Hmisc package.

It is very fast, and only includes pairwise complete observations. The returned object contains a matrix

  1. of correlation scores
  2. with the number of observation used for each correlation value
  3. of a p-value for each correlation

Some example code is available here:

Upvotes: 4

Iterator
Iterator

Reputation: 20570

Try cor(A,B, use = "pairwise.complete.obs"). That will ignore the NAs in your observations.

To be statistically rigorous, you should also look at the # of missing entries in your data and look at whether the missing at random assumption holds.

Edit 1: Take a look at ?cor to see other options for the use parameter.

Upvotes: 13

Related Questions