japem
japem

Reputation: 1111

cor() function in R with a subset

I have a table in R with three columns. I want to get the correlation of the first two columns with a subset of the third column following a specific set of conditions (values are all numeric, I want them to be > a certain number). The cor() function doesn't seem to have an argument to define such a subset.

I know that I could use the summary(lm()) function and square-root the r^2, but the issue is that I'm doing this inside a for loop and am just appending the correlation to a separate list that I have. I can't really append part of the summary of the regression easily to a list.

Here is what I am trying to do:

for (i in x) {list[i] = cor(data$column_a, data$column_b, subset = data$column_c > i)}

Obviously, though, I can't do that because the cor() function doesn't work with subsets.

(Note: x = seq(1,100) and list = NULL)

Upvotes: 0

Views: 2156

Answers (3)

eipi10
eipi10

Reputation: 93761

You can do this without a loop using lapply. Here's some code that will output a data frame with the month-range in one column and the correlation in another column. The do.call(rbind... business is just to take the list output from lapply and turn it into a data frame.

corrs = do.call(rbind, lapply(min(airquality$Month):max(airquality$Month), 
                              function(x) {
          data.frame(month_range=paste0(x," - ", max(airquality$Month)), 
             correlation = cor(airquality$Temp[airquality$Month >= x & airquality$Temp < 80],
                               airquality$Wind[airquality$Month >= x & airquality$Temp < 80]))
          }))

corrs 
  month_range correlation
1       5 - 9  -0.3519351
2       6 - 9  -0.2778532
3       7 - 9  -0.3291274
4       8 - 9  -0.3395647
5       9 - 9  -0.3823090

Upvotes: 1

rsoren
rsoren

Reputation: 4206

Based on the pseudo-code you provided alone, here's something that should work:

for (i in x) {
    df <- subset(data, column_c > i)
    list[i] = cor(df$column_a, df$column_b)
}

However, I don't know why you would want your index in list[i] to be the same value that you use to subset column_c. That could be another source of problems.

Upvotes: 0

foxygen
foxygen

Reputation: 1388

You can subset the data first, and then find the correlation.

a <- subset(airquality, Temp < 80 & Month > 7)
cor(a$Temp, a$Wind)

Edit: I don't really know what your list variable is, but here is an example of dynamically changing the subset based on i (see how the month requirement changes with each iteration)

list <- seq(1, 5)

for (i in 1:5){

  a <- subset(airquality, Temp < 80 & Month > i)
  list[i] <- cor(a$Temp, a$Wind)

}

Upvotes: 0

Related Questions