Calculate correlation between columns on multiple data frames

Question

I'm trying to make a function that reads multiple csv files into a data frame, checks to see how many complete (no NAs) observations there are, and then, if the number of complete observations is greater than a threshold that is passed to the function as an argument, returns a vector with the correlations between two columns in each data frame that fit the criteria.

Right now I have the following code:

> dput(corr)
function (threshold = 0, directory = "/Users/marsh/datasciencecoursera/specdata/") 
{
setwd(directory)
data_files <- list.files()
output <- c()
for (i in data_files) {
    raw_data <- read.csv(data_files[i])
    raw_data_nona <- na.omit(raw_data)
    if (nrow(raw_data_nona) > threshold) {
        sulfate <- raw_data_nona[, "sulfate"]
        nitrate <- raw_data_nona[, "nitrate"]
        correlation <- cor(sulfate, y = nitrate)
        ouput <- c(ouput, correlation)
    }
}
ouput
}

When I try to run the code with a threshold of 150, 200, 400, etc, I get an error message that reads:

Error in file(file, "rt") : cannot open the connection In addition: Warning message:
In file(file, "rt") : cannot open file 'NA': No such file or directory

I'm not sure what is going wrong. I've checked that the directory is right countless times and when I run the code in the console line by line, in piece-meal type manner, it sometimes works. Any help on why the function can't seem to connect to the files would be appreciated.

Marius · Accepted Answer

I think your problem is in these two lines:

for (i in data_files) {
    raw_data <- read.csv(data_files[i])

I assume data_files is a vector of filenames like c("data1.csv", "data2.csv"). Then on each iteration in the for loop, i will be a string like "data1.csv". It looks like you expected it to be a number, the index of the current position. You don't need to index back into data_files, you already have the string, so just do:

for (i in data_files) {
    raw_data <- read.csv(i)

Calculate correlation between columns on multiple data frames

Answers (1)

Related Questions